April 11, 2026

How to Generate Subtitles with Whisper: A Practical Tutorial (2026)

What Whisper is, why it can generate timed subtitles, and a step-by-step guide to SRT, VTT, or ASS from audio or video — file limits, language settings, export options, and troubleshooting.

This tutorial explains what Whisper is, why it can produce subtitles, and how to go from an audio or video file to a downloadable SRT / VTT / ASS file using Pancake Subtitle Tools in the browser.

What Is Whisper?

Whisper is a speech recognition system released by OpenAI. It is implemented as a machine learning model trained on a large amount of multilingual audio and matching text. When you give it a recording of human speech, Whisper predicts the words that were said — similar in purpose to classic "speech-to-text," but designed to be robust across accents, background noise, and many languages.

You do not "install Whisper" inside your word processor; it runs as software that analyzes audio (in the cloud or on a server) and returns a transcript. Modern integrations — including the tools below — wrap Whisper so you upload a file and receive editable text with timing, not just a single block of plain text.

Why Can Whisper Generate Subtitles?

Subtitles are text lines shown at specific times during playback. To create them automatically, a system must do two things:

Recognize speech — Turn what people say into written words (the transcript).
Align text to time — Know when each phrase starts and stops so each line can get a start time and end time (a cue).

Whisper is built for continuous audio: internally it processes the waveform and outputs not only the text but also segment-level timing (approximate start and end for each chunk of speech). That is exactly what subtitle formats need. Tools like Audio to Subtitle and Video to Subtitle take Whisper's output and format it as SRT, WebVTT, or ASS — file types players and editors understand — so you get ready-to-use subtitle files instead of typing and timecoding everything by hand.

In short: Whisper generates subtitles because it both transcribes speech and associates the text with timestamps along the audio timeline.

What You Get from Whisper Subtitle Generation

When you run Whisper on your media, you typically get:

Timed lines (cues) with start and end times
Transcript text for each segment
Export as SRT, WebVTT (VTT), or ASS — the same formats covered in What is an SRT file, What is a VTT file, and What is an ASS file

Whisper handles many languages and works well with clear speech. Background noise, heavy accents, or overlapping speakers can reduce accuracy — you can always fix cues in the built-in editor before downloading.

Before You Start

Supported inputs

Audio only — Use the Audio to Subtitle tool. Common formats include MP3, WAV, M4A, AAC, FLAC, OGG, and other typical audio uploads.
Video — Use the Video to Subtitle tool. Upload MP4, WebM, MOV, MKV, and other common video formats; the audio track is sent for transcription.

Limits (important)

Maximum upload size: 100 MB per file.
Free transcription applies when the media is under 1 minute. For longer audio or video, sign in with Google so the tool can process the full length.

What to prepare

A reasonably clear recording (closer mic, less background noise = better subtitles).
If you know the spoken language, selecting it can help; auto-detect is fine for many files.

Step 1: Choose Audio or Video Workflow

Option A — You have an audio file (podcast, voice memo, extracted WAV, etc.)

Open Audio to Subtitle.
Upload your file (drag-and-drop or file picker).
Choose the spoken language, or leave auto-detect if you are unsure.
Click Generate subtitles and wait for processing.

Option B — You have a video file

Open Video to Subtitle.
Upload the video (video-only workflow; the tool uses the audio inside the file).
Set language or auto-detect, then Generate subtitles.

If your video is huge or over the size limit, export audio only (e.g., WAV or MP3) in an editor, then use Audio to Subtitle instead.

Step 2: Review and Edit Cues

Whisper is strong, but no automatic transcript is perfect. After generation:

Read through timing — lines that start too early or late are easy to spot while skimming.
Fix names, jargon, and numbers Whisper may mishear.
Split or merge lines if a cue is too long for one screen or a sentence is broken awkwardly.

The tool shows cues in an editor so you can adjust text and timing before export — similar in spirit to cleaning up an SRT in a desktop app, but without leaving the page.

Step 3: Pick Export Format and Download

Choose the format that matches your next step:

Format	Good for
SRT	YouTube, many editors, universal players — see SRT guide
VTT	HTML5 `<video>`, web players — see VTT guide
ASS	Styled playback, fansub workflows — see ASS guide

Download the file and keep UTF-8 encoding (the tool handles this for you). You can then upload subtitles to platforms or load them in VLC; a broader walkthrough is in How to add subtitles to a video.

Step 4 (Optional): Translate or Convert

Another language, same timing — Use the Subtitle Translation Tool on your SRT or VTT file.
Format conversion only — Use the site's converters (e.g. SRT to VTT, VTT to SRT) if a platform requires a specific extension.

Tips for Better Whisper Subtitles

Audio quality — Normalize volume if the recording is very quiet; avoid heavy music under dialogue when possible.

Language — If results look like the wrong language, set the language explicitly instead of auto-detect.

Long files — Stay within 100 MB; for very long content, split into parts, transcribe each, then merge subtitle files in an editor if needed (renumber cues when merging).

Account for > 1 minute — Sign in with Google when prompted so longer media can be transcribed after the free under-one-minute tier.

Troubleshooting

Upload rejected or too large
Compress the file, shorten the clip, or export a lower-bitrate audio version under 100 MB.

Transcript is empty or very short
Check that the file actually contains audible speech on the track you expect (some videos have silent intros or only music).

Many word errors
Improve the source audio, specify the correct language, or manually correct key terms in the editor before download.

I need burned-in subtitles
Whisper tools here produce soft subtitles (separate files). Burning text into the video requires an editor or encoder that supports subtitle burn-in — not covered by the Whisper upload flow itself.

Summary

Whisper turns speech into timestamped subtitles. With Audio to Subtitle or Video to Subtitle, you upload media (up to 100 MB), choose or auto-detect language, generate cues, edit them, then download SRT, VTT, or ASS. Short clips are free; longer than one minute requires Google sign-in.

From there, add subtitles to your video pipeline, translate with the Subtitle Translation Tool, or convert formats as needed — your subtitles are ready for editing platforms and players that accept standard subtitle files.