Scriptivox logoScriptivox

    Get started

    OverviewQuickstartPricing

    API Reference

    TranscribeFile UploadGet ResultBalanceError CodesRate LimitsFormatsLanguages

    Guides

    Webhooks

    Use Cases

    Folder Watcher
    Scriptivox logoScriptivoxAPI Documentation

    API Reference

    Complete reference for the Scriptivox transcription API. All endpoints require an API key, passed in one of:

    • Authorization: sk_live_… (the Bearer prefix is accepted but not required)
    • X-Api-Key: sk_live_…

    Header names are case-insensitive. Each account can have at most 5 active API keys at a time — revoke an unused key in the dashboard before creating a new one if you hit this ceiling.

    Base URL: https://api.scriptivox.com/v1

    Service status: real-time uptime + incident history at status.scriptivox.com.


    Transcribe

    Send audio for transcription. You can either pass a URL (we download it) or upload your own file.

    From a URL

    The simplest path — one POST request. We download the file, validate it, and start transcription automatically. Supports Google Drive, Dropbox, and OneDrive sharing links.

    POST/v1/transcribe

    Start a transcription from a public URL. The file is downloaded and validated in the background. Poll GET /v1/transcribe/{id} for status updates. Duration and cost are determined after download.

    Authentication: Authorization: sk_live_YOUR_KEY

    Request Body

    ParameterTypeRequiredDescription
    urlstringRequiredPublic URL to an audio/video file (http or https). Supports Google Drive, Dropbox, and OneDrive sharing links. Max 2048 characters.
    languagestringOptionalISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.
    diarizebooleanOptionalEnable speaker diarization to identify who said what. Default: false.
    speaker_countintegerOptionalExpected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers.
    alignbooleanOptionalEnable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false.
    webhook_urlstringOptionalURL to receive completion/failure webhook. HTTPS recommended.
    urlstringRequired

    Public URL to an audio/video file (http or https). Supports Google Drive, Dropbox, and OneDrive sharing links. Max 2048 characters.

    languagestringOptional

    ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.

    diarizebooleanOptional

    Enable speaker diarization to identify who said what. Default: false.

    speaker_countintegerOptional

    Expected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers.

    alignbooleanOptional

    Enable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false.

    webhook_urlstringOptional

    URL to receive completion/failure webhook. HTTPS recommended.

    Response

    json
    {
    "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
    "status": "created",
    "message": "Transcription created. The file will be downloaded and processed. Poll GET /v1/transcribe/{id} for status updates."
    }

    Code Examples

    resp = requests.post("https://api.scriptivox.com/v1/transcribe",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    json={
    "url": "https://example.com/podcast-episode.mp3",
    "diarize": True,
    "language": "en"
    })
    job = resp.json()
    # Poll for status
    import time
    while True:
    result = requests.get(
    f"https://api.scriptivox.com/v1/transcribe/{job['id']}",
    headers={"Authorization": "sk_live_YOUR_KEY"}).json()
    if result["status"] in ("completed", "failed"):
    break
    time.sleep(5)

    From a file upload

    Upload your own file when you need full control or don't have a public URL.

    POST/v1/upload

    Get a presigned URL to upload an audio or video file. The URL expires in 1 hour. Upload your file to the returned URL with a PUT request, then pass the upload_id to POST /v1/transcribe.

    Authentication: Authorization: sk_live_YOUR_KEY

    Request Body

    ParameterTypeRequiredDescription
    filenamestringRequiredName of the file being uploaded (e.g. "meeting.mp3"). Maximum 255 characters.
    filenamestringRequired

    Name of the file being uploaded (e.g. "meeting.mp3"). Maximum 255 characters.

    Response

    json
    {
    "upload_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "upload_url": "https://storage.supabase.co/...",
    "expires_in": 3600,
    "method": "PUT",
    "headers": {
    "Content-Type": "audio/mpeg"
    }
    }

    Code Examples

    import requests
    resp = requests.post("https://api.scriptivox.com/v1/upload",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    json={"filename": "meeting.mp3"})
    upload = resp.json()
    with open("meeting.mp3", "rb") as f:
    requests.put(upload["upload_url"],
    headers=upload["headers"], data=f)
    POST/v1/transcribe

    Start a transcription from an uploaded file. Pass the upload_id from POST /v1/upload. The file is validated in the background. Poll GET /v1/transcribe/{id} for status updates. Duration and cost are determined after validation.

    Authentication: Authorization: sk_live_YOUR_KEY

    Request Body

    ParameterTypeRequiredDescription
    upload_idstringRequiredThe upload ID from POST /v1/upload.
    languagestringOptionalISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.
    diarizebooleanOptionalEnable speaker diarization to identify who said what. Default: false.
    speaker_countintegerOptionalExpected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers.
    alignbooleanOptionalEnable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false.
    webhook_urlstringOptionalURL to receive completion/failure webhook. HTTPS recommended.
    upload_idstringRequired

    The upload ID from POST /v1/upload.

    languagestringOptional

    ISO 639-1 language code (e.g. "en", "es", "fr"). Omit for auto-detection. Warning: forcing a wrong language may produce a translation instead of a transcription.

    diarizebooleanOptional

    Enable speaker diarization to identify who said what. Default: false.

    speaker_countintegerOptional

    Expected number of speakers (1–50). Requires diarize to be true. Strongly recommended when you know the speaker count — providing it noticeably improves diarization accuracy. If omitted, the model auto-detects, which can over- or under-segment speakers.

    alignbooleanOptional

    Enable word-level timestamps with start/end times for every word, plus per-word confidence scores when the alignment model supports them. Default: true. Confidence is language-dependent — some languages return null. Note: when diarize is true, alignment is automatically enabled (required for speaker assignment), even if you pass align: false.

    webhook_urlstringOptional

    URL to receive completion/failure webhook. HTTPS recommended.

    Response

    json
    {
    "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
    "status": "created",
    "message": "Transcription created. The file will be validated and processed. Poll GET /v1/transcribe/{id} for status updates."
    }

    Code Examples

    resp = requests.post("https://api.scriptivox.com/v1/transcribe",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    json={
    "upload_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "diarize": True,
    "speaker_count": 2,
    "language": "en"
    })
    job = resp.json()
    # Poll for status
    import time
    while True:
    result = requests.get(
    f"https://api.scriptivox.com/v1/transcribe/{job['id']}",
    headers={"Authorization": "sk_live_YOUR_KEY"}).json()
    if result["status"] in ("completed", "failed"):
    break
    time.sleep(5)

    Get result

    GET/v1/transcribe/{id}

    Get the status and result of a transcription. Poll this endpoint until status is completed or failed, or use webhooks for real-time notifications. Pass ?format=srt|vtt|text to export the transcript directly as captions or plain text instead of JSON.

    Authentication: Authorization: sk_live_YOUR_KEY

    Request Body

    ParameterTypeRequiredDescription
    idstringRequiredThe transcription ID returned from POST /v1/transcribe (passed in the URL path)
    formatstringOptionalOutput format. One of: "json" (default, full structured response), "srt" (SubRip captions), "vtt" (WebVTT captions), "text" (plain text only). Requires status=completed for srt/vtt/text. Caption formats return text/plain (srt) or text/vtt (vtt) Content-Type.
    max_wordsintegerOptionalCaption segmentation: maximum words per cue. 1–50. Default 4. Only applies to format=srt|vtt|text. Lower = shorter, faster-changing captions; higher = longer cues. Whichever of max_words/max_chars/max_duration trips first ends the segment.
    max_charsintegerOptionalCaption segmentation: maximum characters per cue. 10–500. Default 80. Standard subtitle convention is ~37 chars/line × 2 lines = ~80.
    max_durationintegerOptionalCaption segmentation: maximum seconds per cue. 1–60. Default 10. Caps how long a single subtitle stays on screen.
    sentence_awarebooleanOptionalCaption segmentation: end a cue whenever a sentence-ending punctuation mark appears (. ! ?). Default true. Produces more natural caption breaks at the cost of slightly more variable cue lengths.
    include_speakersstringOptionalSpeaker labels in captions. "true" prefixes every cue with the speaker label. "false" never includes labels. "auto" (default) includes labels only if the job has more than one distinct speaker. Only meaningful when the job was created with diarize=true.
    strip_charsstringOptionalCharacters to remove from cue text before output. Up to 32 chars. Example: strip_chars=",." removes all commas and periods. Useful for cleaner captions or fitting tighter character limits.
    idstringRequired

    The transcription ID returned from POST /v1/transcribe (passed in the URL path)

    formatstringOptional

    Output format. One of: "json" (default, full structured response), "srt" (SubRip captions), "vtt" (WebVTT captions), "text" (plain text only). Requires status=completed for srt/vtt/text. Caption formats return text/plain (srt) or text/vtt (vtt) Content-Type.

    max_wordsintegerOptional

    Caption segmentation: maximum words per cue. 1–50. Default 4. Only applies to format=srt|vtt|text. Lower = shorter, faster-changing captions; higher = longer cues. Whichever of max_words/max_chars/max_duration trips first ends the segment.

    max_charsintegerOptional

    Caption segmentation: maximum characters per cue. 10–500. Default 80. Standard subtitle convention is ~37 chars/line × 2 lines = ~80.

    max_durationintegerOptional

    Caption segmentation: maximum seconds per cue. 1–60. Default 10. Caps how long a single subtitle stays on screen.

    sentence_awarebooleanOptional

    Caption segmentation: end a cue whenever a sentence-ending punctuation mark appears (. ! ?). Default true. Produces more natural caption breaks at the cost of slightly more variable cue lengths.

    include_speakersstringOptional

    Speaker labels in captions. "true" prefixes every cue with the speaker label. "false" never includes labels. "auto" (default) includes labels only if the job has more than one distinct speaker. Only meaningful when the job was created with diarize=true.

    strip_charsstringOptional

    Characters to remove from cue text before output. Up to 32 chars. Example: strip_chars=",." removes all commas and periods. Useful for cleaner captions or fitting tighter character limits.

    Response

    json
    {
    "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
    "status": "completed",
    "audio_duration_seconds": 120,
    "file_size_bytes": 1920000,
    "language": "en",
    "diarize": true,
    "speaker_count": 2,
    "align": true,
    "cost_cents": 0.5,
    "source_url": "https://example.com/podcast.mp3",
    "progress": "Transcription completed successfully.",
    "created_at": "2025-01-15T10:30:00Z",
    "started_at": "2025-01-15T10:30:01Z",
    "completed_at": "2025-01-15T10:30:45Z",
    "result": {
    "full_transcript": "Hello, thanks for joining...",
    "language": "en",
    "duration_seconds": 120,
    "speakers": ["SPEAKER 1", "SPEAKER 2"],
    "utterances": [
    {
    "start": 0.5,
    "end": 3.2,
    "text": "Hello, thanks for joining the call today.",
    "speaker": "SPEAKER 1",
    "confidence": 0.95,
    "words": [
    {
    "word": "Hello,",
    "start": 0.5,
    "end": 0.9,
    "confidence": 0.98,
    "speaker": "SPEAKER 1"
    }
    ]
    }
    ]
    }
    }

    Code Examples

    resp = requests.get(
    f"https://api.scriptivox.com/v1/transcribe/{job['id']}",
    headers={"Authorization": "sk_live_YOUR_KEY"})
    result = resp.json()
    if result["status"] == "completed":
    print(result["result"]["full_transcript"])

    Status values

    The status lifecycle is created → downloading (URL flow only) → processing → completed | failed.

    StatusDescription
    createdJob accepted. Download (URL flow) or validation (upload flow) is about to start.
    downloadingURL flow only — the file is being fetched from the provided URL.
    processingThe file has been validated and is being transcribed on a GPU worker.
    completedTranscription finished successfully. The result object is now populated.
    failedThe job failed. Inspect error.code and error.message.

    Caption / text export

    GET /v1/transcribe/{id}?format=srt|vtt|text returns the transcript in the requested format with the appropriate Content-Type:

    formatContent-TypeBody
    json (default)application/jsonFull structured response with result.utterances[]
    srttext/plain; charset=utf-8SubRip Text
    vtttext/vtt; charset=utf-8WebVTT, with <v Speaker> voice tags when speaker labels are shown
    texttext/plain; charset=utf-8Plain text. Diarized jobs get a Speaker: prefix per turn with blank lines between turns.

    srt, vtt, and text require status=completed; otherwise returns 400 INVALID_REQUEST. JSON format works at any status.

    Segmentation controls

    The caption formats (srt, vtt, and text) accept the same segmentation knobs as the dashboard's Advanced Export modal — pass them as query params alongside format:

    ParamDefaultRangeEffect
    max_words41–50Maximum words per cue. Lower = shorter, faster-changing captions.
    max_chars8010–500Maximum characters per cue. ~80 is the industry convention (~37 chars × 2 lines).
    max_duration101–60 (seconds)Maximum seconds a single cue stays on screen.
    sentence_awaretruetrue / falseEnd a cue when a sentence ends (. ! ?). Produces more natural breaks.
    include_speakersautotrue / false / autoWhether to prefix each cue with the speaker label. auto includes them only when the job has more than one distinct speaker.
    strip_chars(empty)up to 32 charsCharacters to remove from cue text. E.g. strip_chars=,. drops all commas and periods.

    Whichever of max_words, max_chars, max_duration is exceeded first ends the current cue. sentence_aware adds a sentence-ending punctuation rule on top of those.

    Examples:

    bash
    # Short cues (TikTok-style, max 3 words, 2.5s each, no sentence-awareness)
    curl "https://api.scriptivox.com/v1/transcribe/{id}?format=srt&max_words=3&max_duration=3&sentence_aware=false" \
    -H "Authorization: sk_live_YOUR_KEY"
    # Standard SRT with default settings
    curl "https://api.scriptivox.com/v1/transcribe/{id}?format=srt" \
    -H "Authorization: sk_live_YOUR_KEY"
    # WebVTT, always show speaker tags, strip filler punctuation
    curl "https://api.scriptivox.com/v1/transcribe/{id}?format=vtt&include_speakers=true&strip_chars=,." \
    -H "Authorization: sk_live_YOUR_KEY"
    # Plain text without speaker prefixes
    curl "https://api.scriptivox.com/v1/transcribe/{id}?format=text&include_speakers=false" \
    -H "Authorization: sk_live_YOUR_KEY"

    When the job was made with align: false (no per-word timestamps), the segmentation falls back to utterance-level — max_words/max_chars/max_duration still cap each cue but cuts can only happen at utterance boundaries. For best segmentation control, keep align on (the default).


    List transcriptions

    GET/v1/transcriptions

    List your transcriptions with optional filters and cursor-based pagination. Newest first by default. Soft-deleted transcriptions are excluded.

    Authentication: Authorization: sk_live_YOUR_KEY

    Request Body

    ParameterTypeRequiredDescription
    statusstringOptionalFilter by status. One of: "created", "downloading", "pending", "processing", "completed", "failed".
    fromstringOptionalISO-8601 timestamp. Inclusive lower bound on `created_at`.
    tostringOptionalISO-8601 timestamp. Exclusive upper bound on `created_at`.
    limitintegerOptional1–200. Default 50.
    orderstringOptional"desc" (default, newest first) or "asc".
    cursorstringOptionalOpaque cursor from the `next_cursor` field of a previous response. Use to fetch the next page.
    statusstringOptional

    Filter by status. One of: "created", "downloading", "pending", "processing", "completed", "failed".

    fromstringOptional

    ISO-8601 timestamp. Inclusive lower bound on `created_at`.

    tostringOptional

    ISO-8601 timestamp. Exclusive upper bound on `created_at`.

    limitintegerOptional

    1–200. Default 50.

    orderstringOptional

    "desc" (default, newest first) or "asc".

    cursorstringOptional

    Opaque cursor from the `next_cursor` field of a previous response. Use to fetch the next page.

    Response

    json
    {
    "items": [
    {
    "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
    "status": "completed",
    "audio_duration_seconds": 120,
    "file_size_bytes": 1920000,
    "language": "en",
    "diarize": true,
    "speaker_count": 2,
    "align": true,
    "cost_cents": 0.4,
    "created_at": "2026-05-18T10:30:00Z",
    "started_at": "2026-05-18T10:30:01Z",
    "completed_at": "2026-05-18T10:30:45Z",
    "source_url": "https://example.com/podcast.mp3"
    }
    ],
    "next_cursor": "eyJ0IjoiMjAyNi0wNS0xOFQxMDozMDowMFoiLCJpIjoiYjJjM2Q0ZTUtZjZhNy04OTAxLWJjZGUtZjEyMzQ1Njc4OTAxIn0="
    }

    Code Examples

    resp = requests.get(
    "https://api.scriptivox.com/v1/transcriptions",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    params={"status": "completed", "limit": 20})
    page = resp.json()
    for item in page["items"]:
    print(item["id"], item["status"])
    if page["next_cursor"]:
    # fetch the next page
    resp = requests.get(
    "https://api.scriptivox.com/v1/transcriptions",
    headers={"Authorization": "sk_live_YOUR_KEY"},
    params={"cursor": page["next_cursor"]})

    Each item matches the shape of GET /v1/transcribe/{id} except that the heavy result object is omitted — fetch individual transcriptions for the full transcript. Pagination is stable across new inserts: pass the next_cursor value from the response into the cursor query param to get the next page. next_cursor is null when there are no more pages.


    Cancel transcription

    POST/v1/transcribe/{id}/cancel

    Stop an in-flight transcription. Releases the reserved balance and fires the transcription.failed webhook (if configured) with error.code=CANCELLED. Idempotent — calling cancel on an already-cancelled job returns the same response.

    Authentication: Authorization: sk_live_YOUR_KEY

    Request Body

    ParameterTypeRequiredDescription
    idstringRequiredThe transcription ID (passed in the URL path).
    idstringRequired

    The transcription ID (passed in the URL path).

    Response

    json
    {
    "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
    "status": "failed",
    "error": {
    "code": "CANCELLED",
    "message": "Cancelled by customer"
    },
    "released_cents": 0.5
    }

    Code Examples

    requests.post(
    f"https://api.scriptivox.com/v1/transcribe/{transcription_id}/cancel",
    headers={"Authorization": "sk_live_YOUR_KEY"})

    Cancel is allowed only while the job is in created, downloading, pending, or processing state. Cancelling a completed or failed job returns 409 CONFLICT. Cancellation is best-effort against the GPU — the model may still finish briefly after, but its result is discarded and you are not charged.


    Delete transcription

    DELETE/v1/transcribe/{id}

    Soft-delete a completed or failed transcription. Removes the stored transcript from our storage. The job record is kept for 7 days for audit, then hard-deleted. In-flight jobs cannot be deleted — cancel them first.

    Authentication: Authorization: sk_live_YOUR_KEY

    Request Body

    ParameterTypeRequiredDescription
    idstringRequiredThe transcription ID (passed in the URL path).
    idstringRequired

    The transcription ID (passed in the URL path).

    Response

    json
    (204 No Content — empty body)

    Code Examples

    requests.delete(
    f"https://api.scriptivox.com/v1/transcribe/{transcription_id}",
    headers={"Authorization": "sk_live_YOUR_KEY"})

    Returns 204 No Content on success. Idempotent — deleting an already-deleted transcription also returns 204. Trying to delete an in-flight transcription returns 409 CONFLICT with a message telling you to cancel first.


    Balance

    A non-zero balance is required to start an upload or submit a transcription — POST /v1/upload and POST /v1/transcribe return 402 ZERO_BALANCE when your balance is $0. The exact cost is reserved once the audio duration is known (after download/validation), not on submission.

    GET/v1/balance

    Returns your current account balance in cents, the amount reserved for in-progress transcriptions, the amount available for new jobs, and an estimate of remaining hours at the current per-hour price.

    Authentication: Authorization: sk_live_YOUR_KEY

    Response

    json
    {
    "balance_cents": 1500,
    "reserved_cents": 100,
    "available_cents": 1400,
    "price_per_hour_cents": 20,
    "estimated_hours_available": 93.3,
    "deposit_url": "https://platform.scriptivox.com/billing",
    "updated_at": "2025-01-15T10:30:00Z"
    }

    Code Examples

    resp = requests.get("https://api.scriptivox.com/v1/balance",
    headers={"Authorization": "sk_live_YOUR_KEY"})
    balance = resp.json()
    print(f"Available: ${balance['available_cents'] / 100:.2f}")
    print(f"Hours remaining: {balance['estimated_hours_available']:.1f}")

    Error Codes

    All errors follow the same format:

    json
    {
    "error": {
    "code": "ERROR_CODE",
    "message": "Human-readable description"
    }
    }

    Synchronous errors — returned immediately on the request

    HTTPCodeDescription
    400INVALID_REQUESTMalformed request body or missing required fields
    400INVALID_FILENAMEFilename is missing, too long, contains invalid characters, or has no extension
    400INVALID_MEDIA_FORMATUnsupported file extension at upload time (e.g. .txt, .pdf). Also returned async when ffprobe rejects the actual file contents.
    400FILE_NOT_UPLOADEDFile not found at the upload URL
    400FILE_TOO_LARGEFile exceeds 5 GB limit
    400UPLOAD_ALREADY_USEDUpload already used for a transcription
    400UPLOAD_EXPIREDUpload URL expired (1 hour TTL)
    401INVALID_API_KEYInvalid or missing API key
    401API_KEY_REVOKEDAPI key has been revoked
    402INSUFFICIENT_BALANCENot enough balance for this transcription
    402ZERO_BALANCEBalance is $0 — deposit required
    404UPLOAD_NOT_FOUNDUpload ID does not exist
    404TRANSCRIPTION_NOT_FOUNDTranscription ID does not exist
    404NOT_FOUNDPath doesn't match any endpoint
    405METHOD_NOT_ALLOWEDThe path exists but only accepts a different HTTP method. The response includes an Allow header naming the accepted method (e.g. Allow: GET if you POST to /v1/balance).
    415UNSUPPORTED_MEDIA_TYPERequest body sent without Content-Type: application/json on a POST/PUT/PATCH. The response includes an Accept-Post: application/json header. Parameters are allowed (application/json; charset=utf-8 works).
    409CONFLICTAction not allowed in the current state (e.g. delete on an in-flight job, cancel on a completed one)
    409IDEMPOTENCY_KEY_LOCKEDAnother request with the same Idempotency-Key is mid-flight. Retry after a few seconds (Retry-After header sent).
    422IDEMPOTENCY_KEY_CONFLICTIdempotency-Key reused with a different request body — see Idempotency
    429RATE_LIMIT_EXCEEDEDToo many requests — see Rate Limits
    500INTERNAL_ERRORServer error

    Asynchronous errors — surface only on GET /v1/transcribe/{id}

    POST /v1/transcribe accepts the job and returns {"status":"created"} even if the input will ultimately fail. The errors below only appear on the GET endpoint, with the top-level status set to "failed" and the failure reason in the error object. Your client must handle these on the poll path, not on submit.

    CodeWhen it appears
    URL_NOT_ACCESSIBLEURL flow only — the URL returned 4xx/5xx, didn't resolve, or the connection was refused.
    DOWNLOAD_FAILEDURL flow only — the download started but was interrupted.
    INVALID_MEDIA_FORMATThe file downloaded but ffprobe rejected it: not actually audio/video, no audio track, or shorter than the 1-second minimum (message names the measured duration, e.g. "Audio too short (0.10s). Minimum duration is 1 second.").
    DURATION_TOO_LONGAudio exceeds the 10-hour limit (only known after probing duration).
    PROCESSING_ERRORThe GPU job failed after all retries.
    CREATED_TIMEOUTJob sat in created for more than 30 minutes — validation step never started.
    DOWNLOAD_TIMEOUTJob sat in downloading for more than 15 minutes — file download stalled.
    PROCESSING_TIMEOUTJob sat in processing for more than 45 minutes — GPU never returned a result.
    BILLING_ERRORInternal accounting issue while finalizing the charge — your balance was not debited and the transcript was not delivered. Safe to retry.
    INTERNAL_ERRORA server-side step (e.g. queueing the job to our processing layer) failed after we accepted the request. Safe to retry.
    CANCELLEDThe transcription was cancelled by the customer via POST /v1/transcribe/{id}/cancel. The reserved balance was released; you are not charged.

    Failed transcriptions are free. The reserved balance is released back to your account, so a failure costs $0 regardless of how far the job got.

    error.message is always a sanitized, customer-safe string — we never forward raw GPU/library stack traces or internal paths. If you need more detail than the message provides for a PROCESSING_ERROR, contact support with the transcription ID and we can look up the raw error on our side.


    Important notes

    Language parameter behavior

    Recommended: always pass language explicitly. Auto-detection works in most cases but has a small failure rate that's fully avoidable. Passing the actual language gives more accurate transcripts, faster turnaround (the model skips its detection pass), and protects you from the known edge cases below.

    When you specify a language code, the model is forced to interpret the audio as that language. If the audio is actually in a different language, the model may translate rather than transcribe — for example, setting language: "es" on English audio can produce a Spanish translation of the speech. Omit language (or pass null) to let the model auto-detect.

    Auto-detect picks a single dominant language for the whole file and is not perfect:

    • Code-switched audio (e.g. English/Spanish in the same clip) typically gets transcribed as the dominant language, and segments in the other language may be dropped or mistranscribed.
    • Hindi audio is sometimes routed to Urdu by the detector. If you know the language in advance, pass it explicitly.
    • Short clips (under 30s) give the detector less signal and are more likely to mis-route.
    • Music or background noise during the first few seconds can throw the detector off.

    If you know the language up front — even probabilistically — pass it. The accuracy cost of being wrong is roughly the same as the cost of auto-detect picking the wrong language, but passing it is faster and works on the edge cases above.

    Silence and very short clips

    Whisper-based models are prone to a known hallucination on near-silent or sub-second audio, often producing a phantom "Thank you." or similar filler. If your pipeline can produce silent or extremely short clips, filter them on your side before submitting.

    Size, duration, and retention limits

    • Max file size: 5 GB. Anything larger is rejected with 400 FILE_TOO_LARGE. Very large uploads can also be rejected by the network layer with an HTML 413 page before they reach the API — your client should handle non-JSON error bodies gracefully.
    • Max duration: 10 hours. Files probing longer than this fail with DURATION_TOO_LONG (asynchronous; you'll see it on the GET endpoint after probing completes).
    • Min duration: 1 second. Files shorter than 1 second fail asynchronously with INVALID_MEDIA_FORMAT and a message naming the actual measured duration. (Sub-second clips have too little speech signal to transcribe reliably.)
    • Request body Content-Type: every POST / PUT / PATCH that carries a JSON body must send Content-Type: application/json (parameters allowed, e.g. application/json; charset=utf-8). Anything else is rejected at the gateway with 415 UNSUPPORTED_MEDIA_TYPE and an Accept-Post: application/json header.
    • Presigned URL TTL: 1 hour. Uploads must PUT to the URL within expires_in seconds of receiving it; after that the URL returns 400 UPLOAD_EXPIRED.
    • Audio retention: uploaded source audio is retained for 24 hours after the job ends, then deleted. Transcripts themselves remain available via GET /v1/transcribe/{id}.

    Filename rules

    The filename you pass to POST /v1/upload must satisfy all of the following:

    • ASCII only (rename files with accents, CJK, or emoji before upload — Supabase Storage rejects non-ASCII object keys).
    • No path separators — / and \ are rejected.
    • No `< > " ' `` — rejected to prevent injection into dashboards that render filenames unescaped.
    • Must include a supported extension (e.g. .mp3, .wav, .mp4). See Supported Formats for the full list.
    • 255 characters or fewer.

    Violations return 400 INVALID_FILENAME (or 400 INVALID_MEDIA_FORMAT if the extension is present but unsupported).

    audio_duration_seconds is an integer

    audio_duration_seconds in GET /v1/transcribe/{id} is rounded to the nearest whole second (so an 11.7-second clip returns 11). cost_cents, by contrast, is fractional and exact (e.g. 0.061111 for an 11-second clip at $0.20/hour) — your account balance is debited against the precise value, not a rounded one.

    speaker_count is a soft prior, not a hard ceiling

    When you set speaker_count: N with diarize: true, you're giving the diarizer a hint about how many speakers to expect — not a hard cap. The result may include slightly more or fewer speaker labels than N (e.g. requesting 5 may produce 6). Values outside 1–50, or speaker_count without diarize: true, are rejected with 400 INVALID_REQUEST.

    Recommended: whenever you know how many speakers are on the recording, pass speaker_count. It noticeably improves diarization accuracy — the model uses it as a prior instead of guessing, which cuts down on over-segmentation (one speaker split across two labels) and under-segmentation (two speakers merged into one).

    Per-word leading whitespace

    Word objects in align: true output have whitespace pre-stripped (e.g. {"word":"The"}, not {"word":" The"}). Reconstruct sentence text from utterance.text if you need exact spacing.

    diarize forces align

    diarize requires word-level alignment to map speakers onto each word, so any request with diarize: true is processed as if align: true — even when the caller explicitly passes align: false. The stored job and webhook payload reflect the effective value (align: true), and word objects are present on the result. If you don't want alignment data, leave diarize off.

    Idempotency

    Both POST /v1/upload and POST /v1/transcribe accept an optional Idempotency-Key HTTP header. When present, we cache the response for 24 hours and replay it byte-for-byte on subsequent requests with the same key and body. This protects you against double-billed transcriptions if your network drops the response and your client retries.

    bash
    curl -X POST https://api.scriptivox.com/v1/transcribe \
    -H "Authorization: sk_live_YOUR_KEY" \
    -H "Idempotency-Key: 7c8f5b3a-1234-4d56-90ab-cdef01234567" \
    -H "Content-Type: application/json" \
    -d '{"url": "https://example.com/audio.mp3"}'

    Rules:

    • Key is opaque — we don't parse it. Generate one per logical operation (a UUID is conventional). Reuse the same key when retrying after a failure.
    • 1–255 printable ASCII characters.
    • Same key + same body within 24h → cached response is replayed. The replayed response carries a Idempotent-Replay: true header so you can tell it was cached.
    • Same key + different body → 422 IDEMPOTENCY_KEY_CONFLICT. This is a bug in your retry logic — use a fresh key for a different operation.
    • Same key while a previous request is still in progress → 409 IDEMPOTENCY_KEY_LOCKED with a Retry-After header. Sleep briefly and retry; the cached response will be available once the first request finishes.
    • No header sent → endpoint behaves normally (no caching).

    The replay cache covers 200 success responses only. 4xx and 5xx errors are not cached, so you can retry freely.

    Unknown fields are rejected

    Request bodies on POST /v1/upload and POST /v1/transcribe are strict: any top-level field that isn't in the documented parameter list returns 400 INVALID_REQUEST with the offending key name. This is to surface typos ({"dirize": true}) immediately instead of silently ignoring them. Nested objects (e.g. inside a future metadata field) are not currently inspected.

    Response shape changes with diarize and align

    The GET /v1/transcribe/{id} example above shows the response with diarize: true and align: true (the maximal case). When you turn flags off, some result fields become null or empty:

    Fielddiarize: falsealign: false
    result.speakersnullunchanged
    result.utterances[].speakernullunchanged
    result.utterances[].wordsunchanged[] (empty array)
    result.utterances[].words[].speakerkey absentn/a (no words)

    Top-level metadata (status, cost_cents, audio_duration_seconds, etc.) is identical regardless of flags. result.utterances[].confidence and result.utterances[].words[].confidence may be null — confidence scores depend on the alignment model used for the detected language, and not every language is covered. Always handle null defensively (e.g. skip filtering by confidence rather than dropping the word).

    Speaker labels are "SPEAKER 1", "SPEAKER 2", … (space, 1-indexed) — not "SPEAKER_00". source_url is only present on URL-flow transcriptions; upload-flow jobs omit it.


    Rate Limits

    Limits are enforced per API key per endpoint, plus a per-IP cap across all endpoints. Exceeding any limit returns 429 RATE_LIMIT_EXCEEDED with a Retry-After header.

    ScopeLimitNotes
    POST /v1/upload (per key)60/minPresigned URL generation
    POST /v1/transcribe (per key)60/minJob submission
    GET /v1/transcribe/{id} (per key)200/minHigher limit for polling
    GET /v1/transcriptions (per key)60/minList your jobs
    POST /v1/transcribe/{id}/cancel (per key)30/minCancel in-flight
    DELETE /v1/transcribe/{id} (per key)30/minSoft-delete
    GET /v1/balance (per key)100/minBalance checks
    Per source IP (across all endpoints)300/minEdge-level cap to prevent abuse

    Limits use a sliding 60-second window, not a fixed-window counter — short bursts above the per-minute number are tolerated as long as the rolling 60-second total stays under the limit. Plan against the steady-state number, not the burst.

    Rate limit headers

    Every response includes:

    HeaderDescription
    X-RateLimit-LimitMaximum requests allowed per minute for this endpoint.
    X-RateLimit-RemainingRequests remaining in the current rolling window.
    X-RateLimit-ResetUnix timestamp when the window fully resets.
    Retry-AfterSeconds to wait before retrying. Only present on 429 responses.

    Supported Formats

    25 container/codec combinations are accepted (10 audio + 15 video). Maximum file size is 5 GB; maximum duration is 10 hours. Unrecognized extensions return 400 INVALID_MEDIA_FORMAT.

    Audio (10)

    ExtensionFormat
    .mp3MPEG Audio
    .wavWaveform Audio
    .m4aMPEG-4 Audio
    .aacAdvanced Audio Coding
    .oggOgg Vorbis
    .flacFree Lossless Audio
    .opusOpus
    .wmaWindows Media Audio
    .aiffAudio Interchange
    .cafCore Audio Format

    Video (15)

    ExtensionFormat
    .mp4MPEG-4 Video
    .movQuickTime
    .aviAudio Video Interleave
    .mkvMatroska Video
    .webmWebM
    .wmvWindows Media Video
    .flvFlash Video
    .m4vMPEG-4 Video (iTunes)
    .3gp3GPP
    .mpegMPEG Video
    .mtsAVCHD
    .ogvOgg Video
    .tsMPEG Transport Stream
    .vobDVD Video Object
    .f4vFlash MP4 Video

    Supported Languages

    119 languages are supported. Pass the ISO 639-1 (or BCP-47 fallback, e.g. yue, kea) language code below in the language parameter.

    We recommend always passing a language explicitly. Omitting the parameter (or passing null) triggers auto-detection, which works for most inputs but has a small chance of picking the wrong language — especially on short clips, code-switched audio, or files that start with music or background noise. See Language parameter behavior for details.

    Invalid codes return 400 INVALID_REQUEST.

    LanguageCode
    Afrikaansaf
    Albaniansq
    Amharicam
    Arabicar
    Armenianhy
    Assameseas
    Asturianast
    Azerbaijaniaz
    Bashkirba
    Basqueeu
    Belarusianbe
    Bengalibn
    Bosnianbs
    Bretonbr
    Bulgarianbg
    Cantoneseyue
    Cape Verdean Creolekea
    Catalanca
    Cebuanoceb
    Chichewany
    Chinesezh
    Croatianhr
    Czechcs
    Danishda
    Dutchnl
    Englishen
    Estonianet
    Faroesefo
    Finnishfi
    Frenchfr
    Fulaff
    Galiciangl
    Georgianka
    Germande
    Greekel
    Gujaratigu
    Haitian Creoleht
    Hausaha
    Hawaiianhaw
    Hebrewhe
    Hindihi
    Hungarianhu
    Icelandicis
    Igboig
    Indonesianid
    Irishga
    Italianit
    Japaneseja
    Javanesejw
    Kambakam
    Kannadakn
    Kazakhkk
    Khmerkm
    Koreanko
    Kyrgyzky
    Laolo
    Latinla
    Latvianlv
    Lingalaln
    Lithuanianlt
    Lugandalg
    Luoluo
    Luxembourgishlb
    Macedonianmk
    Malagasymg
    Malayms
    Malayalamml
    Maltesemt
    Maorimi
    Marathimr
    Mongolianmn
    Myanmarmy
    Nepaline
    Northern Sothonso
    Norwegianno
    Nynorsknn
    Occitanoc
    Odiaor
    Oromoom
    Pashtops
    Persianfa
    Polishpl
    Portuguesept
    Punjabipa
    Romanianro
    Russianru
    Sanskritsa
    Serbiansr
    Shonasn
    Sindhisd
    Sinhalasi
    Slovaksk
    Sloveniansl
    Somaliso
    Sorani Kurdishckb
    Spanishes
    Sundanesesu
    Swahilisw
    Swedishsv
    Tagalogtl
    Tajiktg
    Tamilta
    Tatartt
    Telugute
    Thaith
    Tibetanbo
    Turkishtr
    Turkmentk
    Ukrainianuk
    Umbunduumb
    Urduur
    Uzbekuz
    Vietnamesevi
    Welshcy
    Wolofwo
    Xhosaxh
    Yiddishyi
    Yorubayo
    Zuluzu

    Webhooks

    Real-time completion notifications

    Pricing

    Pay-as-you-go at $0.20/hour