Quickstart

Get your first transcription running in under 5 minutes.

Create an API key

Go to the API Keys page in your dashboard and create a new key. Copy it — you'll only see it once.

Your key looks like: sk_live_12ab34cd...

Add balance

Transcription costs $0.20/hour of audio, billed per second. Add funds on the Billing page. Minimum deposit is $5.00 (~25 hours of audio).

Transcribe

Send a URL and we'll download, validate, and transcribe it. Supports direct file links, Google Drive, Dropbox, and OneDrive sharing links.

resp = requests.post(f"{BASE}/transcribe",
    headers={"Authorization": API_KEY},
    json={
        "url": "https://example.com/podcast.mp3",
        "diarize": True
    })
job = resp.json()
print(f"Transcription ID: {job['id']}")

You'll get back a transcription ID immediately. The file downloads and processes in the background.

Optionally enable speaker diarization with diarize: true, and wire up automatic webhooks via webhook_url. If you know how many speakers are on the recording, pass speaker_count along with diarize — providing it noticeably improves accuracy versus letting the model auto-detect. Word-level timestamps (align) are on by default; pass align: false to opt out. When diarize: true, alignment is always enabled (it's required for speaker assignment) regardless of what you pass.

Pass language when you know it. If you omit it, the model auto-detects, which usually works but can misclassify short clips, code-switched audio, or files that start with music. Passing the ISO code (e.g. "language": "en") is both faster and more accurate. See the language parameter notes for details.

Get the result

Poll until the status is completed or failed. Typical transcriptions complete in under a minute.

import time

while True:
    resp = requests.get(f"{BASE}/transcribe/{job['id']}",
        headers={"Authorization": API_KEY})
    result = resp.json()

    if result["status"] == "completed":
        print(result["result"]["full_transcript"])
        break
    elif result["status"] == "failed":
        print("Error:", result["error"])
        break

    time.sleep(5)

The response includes the full transcript, timestamped utterances, word-level timestamps (alignment is on by default), and speaker labels when diarization is enabled:

json

{
  "id": "txn-456",
  "status": "completed",
  "audio_duration_seconds": 120,
  "cost_cents": 0.6667,
  "result": {
    "full_transcript": "Hello, thanks for joining the call today...",
    "language": "en",
    "duration_seconds": 120,
    "speakers": ["SPEAKER 1", "SPEAKER 2"],
    "utterances": [
      {
        "start": 0.5,
        "end": 3.2,
        "text": "Hello, thanks for joining the call today.",
        "speaker": "SPEAKER 1",
        "confidence": 0.95,
        "words": [
          { "word": "Hello,", "start": 0.5, "end": 0.9, "confidence": 0.98, "speaker": "SPEAKER 1" },
          { "word": "thanks", "start": 1.0, "end": 1.3, "confidence": 0.97, "speaker": "SPEAKER 1" }
        ]
      }
    ]
  }
}

Need to upload your own files?

If you don't have a public URL, you can upload files directly using the file upload flow in the API Reference.

API Reference

Full endpoint documentation

Webhooks

Real-time completion notifications