Transcription issue with audio files longer than 10 minutes when CUDA is enabled - Linux Ubuntu 20.04 #2225

jgoer · 2024-06-09T06:22:31Z

jgoer
Jun 9, 2024

Hello,

I am encountering an issue with Whisper.cpp. When I try to transcribe a WAV file longer than 10 minutes (around 40 MB), Whisper returns an infinite loop containing "[BLANK_AUDIO]" or sometimes "– Subtitling: Le Crayon d'oreille".

This is an output of the result a get with the commande line : ./main ./test.wav --model ./models/ggml-large-v3.bin --language AUTO
`whisper_init_from_file_with_params_no_state: loading model from '/SWAPI/ggml-medium.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_backend_init: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA A40, compute capability 8.6, VMM: yes
Device 1: NVIDIA A40, compute capability 8.6, VMM: yes
Device 2: NVIDIA A40, compute capability 8.6, VMM: yes
whisper_model_load: CUDA0 total size = 1533.14 MB
whisper_model_load: model size = 1533.14 MB
whisper_backend_init: using CUDA backend
whisper_mel_init: n_len = 3001, n_len_org = 1, n_mel = 80
whisper_init_state: kv self size = 150.99 MB
whisper_init_state: kv cross size = 150.99 MB
whisper_init_state: kv pad size = 6.29 MB
whisper_init_state: compute buffer (conv) = 28.68 MB
whisper_init_state: compute buffer (encode) = 594.22 MB
whisper_init_state: compute buffer (cross) = 7.85 MB
whisper_init_state: compute buffer (decode) = 142.09 MB

main: processing './test.wav ' (19197516 samples, 1199.8 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...

whisper_mel_init: n_len = 122984, n_len_org = 119984, n_mel = 80
whisper_full_with_state: auto-detected language: en (p = 0.361893)

[00:00:00.000 --> 00:00:02.060] [BLANK_AUDIO]
[00:00:03.060 --> 00:00:05.120] [BLANK_AUDIO]
[00:00:06.120 --> 00:00:08.180] [BLANK_AUDIO]
[00:00:09.180 --> 00:00:11.240] [BLANK_AUDIO]
[00:00:12.240 --> 00:00:14.300] [BLANK_AUDIO]
[00:00:15.300 --> 00:00:17.360] [BLANK_AUDIO]
[00:00:18.360 --> 00:00:20.420] [BLANK_AUDIO]
[00:00:21.420 --> 00:00:23.480] [BLANK_AUDIO]
[00:00:24.480 --> 00:00:26.540] [BLANK_AUDIO]
...
`

I am using version v1.6.2 of Whisper.cpp. My OS is Ubuntu 20.04, NVIDIA Drivers 535.171.04 and CUDA version is 12.3. My GPU is an NVIDIA A40. I have tested with the models ggml-large-v2.bin and ggml-large-v3.bin, but the problem remains the same. When I disable the GPU version with the flag --no-gpu, the transcription proceeds without any issue.

Has anyone else encountered this problem?

Thanks in advance.

ggerganov · 2024-06-10T07:46:44Z

ggerganov
Jun 10, 2024
Maintainer

Could you try building whisper.cpp from the latest master branch and see if the issue persists?

7 replies

jgoer Jun 10, 2024
Author

I just tried the command : "CUDA_VISIBLE_DEVICES=0 ./main ./test.wav --model ./models/ggml-large-v3.bin --language auto -fa -bs 1 -mc 0" but it still the same.

I don't think my files are causing the problem because the transcription works very well when I disable the GPU.

ggerganov Jun 10, 2024
Maintainer

Yes, but I don't have a file that reproduces the issue, so I won't be able to debug it

jgoer Jun 10, 2024
Author

You can download a mp3 file here : https://drive.google.com/file/d/1UpJ7yBNygDVKnSOet_zJ7mHBq_mg-2tw/view

Thanks in advance

jgoer Jun 10, 2024
Author

Here is another file for which the transcription with GPU works: https://drive.google.com/file/d/1ihfiLBfSgiDCXPfd7OvzDZeTi8vkdXUs/view?usp=sharing

This file has a duration of 9:09 and it was extracted from the previous file.

Thanks in advance

ggerganov Jun 10, 2024
Maintainer

Hm, looks like #2206 is a regression.

You mentioned that you are using v1.6.2. Does that mean the exact commit of the v1.6.2 release or the latest commit from master?

In any case, can you confirm that using the first audio and the exact v1.6.2 commit, you get correct transcription like this:

ffmpeg -i ~/test.mp3 -ar 16000 -ac 1 -c:a pcm_s16le test.wav
git checkout v1.6.2

WHISPER_CUDA=1 WHISPER_CUBLAS=1 make -j && ./main -m models/ggml-large-v2.bin -f test.wav  -l AUTO

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v2.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_backend_init: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
whisper_model_load:    CUDA0 total size =  3093.99 MB
whisper_model_load: model size    = 3093.99 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  251.66 MB
whisper_init_state: kv cross size =  251.66 MB
whisper_init_state: kv pad  size  =    7.86 MB
whisper_init_state: compute buffer (conv)   =   34.82 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  213.19 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0

main: processing '/home/ggerganov/test.wav' (56921443 samples, 3557.6 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: fr (p = 0.999515)

[00:00:00.000 --> 00:00:02.600]   *Voix off* Vous écoutez France Culture
[00:00:02.600 --> 00:00:22.600]   *Générique*
[00:00:22.600 --> 00:00:26.540]   Place de la Toile, le magazine hebdomadaire des cultures numériques, des nouvelles technologies
[00:00:26.540 --> 00:00:31.600]   et de leurs usages de sujets aujourd'hui, le laboratoire de haute sécurité informatique
[00:00:31.600 --> 00:00:37.080]   avec Jean-Yves Marion et Radou Statté, et puis la pensée PowerPoint avec Franck Fromer.
[00:00:37.080 --> 00:00:38.600]   Bonjour à toutes et à tous.
[00:00:38.600 --> 00:00:46.080]   *Générique*
[00:00:46.080 --> 00:00:51.080]   Dans une quarantaine de minutes, nous parlerons de PowerPoint, le fameux logiciel de présentation
[00:00:51.080 --> 00:00:52.080]   visuelle de Microsoft.
[00:00:52.080 --> 00:00:56.120]   Alors on ne parlera pas de PowerPoint pour analyser ses derniers développements, mais
[00:00:56.120 --> 00:00:59.720]   parce qu'au dire de Franck Fromer, il rendrait stupide.
[00:00:59.720 --> 00:01:00.720]   Bonjour Franck Fromer.
[00:01:00.720 --> 00:01:05.480]   Vous venez de publier aux éditions La Découverte la pensée PowerPoint, enquête sur ce logiciel
[00:01:05.480 --> 00:01:09.920]   qui rend stupide, livre dans lequel vous montrez comment PowerPoint cesse d'être seulement
[00:01:09.920 --> 00:01:14.000]   un outil de présentation pour devenir une manière de se représenter le monde et par
[00:01:14.000 --> 00:01:17.520]   la même une manière de se représenter des solutions aux problèmes du monde, ce qui
[00:01:17.520 --> 00:01:21.520]   peut être assez inquiétant quand ces problèmes sont une réforme de l'état, la RGPP en
[00:01:21.520 --> 00:01:23.680]   France ou la guerre en Afghanistan.
[00:01:23.680 --> 00:01:27.240]   Bref, la pensée PowerPoint, ce sera avec vous dans une quarantaine de minutes, mais
[00:01:27.240 --> 00:01:30.880]   en attendant, vous pouvez évidemment intervenir quand bon vous semble dans cette première
[00:01:30.880 --> 00:01:32.360]   partie de Place de la Toile.
[00:01:32.360 --> 00:01:40.240]   *Générique*
...

iboB · 2024-06-10T11:30:26Z

iboB
Jun 10, 2024
Collaborator

Right... I wrote the CUDA code to more efficiently calculate the mel spectrogram, but I just assumed the max audio length is 30 sec, and the code preallocates some work buffers with this assumption :)

My bad.

I'll push a fix today, but in the meantime @jgoer to get it working right now, you can change this line here:

https://github.com/ggerganov/whisper.cpp/blob/c2bdb960cdb69af813d3a4bd72a90f852d695127/whisper.cpp#L3169

to just #if 0. Thus the cpu-based mel calculation will be used every time, even if cuda is available

1 reply

jgoer Jun 10, 2024
Author

Thank you very much, I just applied your fix and it is working now.

Thank you also for this project, which is really awesome!

iboB · 2024-06-10T12:51:55Z

iboB
Jun 10, 2024
Collaborator

So, @ggerganov I added a PR which addresses this: #2227

Better strategies than this exist which may be employed in the future:

Instead of auto-grow and persist temp buffers: if audio is greater than some fixed size, allocate temp buffers for this calculation alone. Thus occasional super-long audios won't clog the gpu for a long time. This does not solve the issue if there simply is not enough gpu memory.
- Like above but try allocating calculation-only buffers and return nullptr if not successful. Then fallback to CPU only if GPU calculation returns nullptr
Add a config setting on whether to force GPU calculation for long audios (this is applicable both to the configuration in the PR and the one suggested above)

1 reply

ggerganov Jun 10, 2024
Maintainer

Ok thanks. @jgoer please give #2227 a few tests and let us know if everything looks ok now

Transcription issue with audio files longer than 10 minutes when CUDA is enabled - Linux Ubuntu 20.04 #2225

Uh oh!

Uh oh!

jgoer Jun 9, 2024

Replies: 3 comments · 9 replies

Uh oh!

ggerganov Jun 10, 2024 Maintainer

Uh oh!

Uh oh!

jgoer Jun 10, 2024 Author

Uh oh!

ggerganov Jun 10, 2024 Maintainer

Uh oh!

jgoer Jun 10, 2024 Author

Uh oh!

jgoer Jun 10, 2024 Author

Uh oh!

Uh oh!

ggerganov Jun 10, 2024 Maintainer

Uh oh!

iboB Jun 10, 2024 Collaborator

Uh oh!

jgoer Jun 10, 2024 Author

Uh oh!

iboB Jun 10, 2024 Collaborator

Uh oh!

ggerganov Jun 10, 2024 Maintainer

jgoer
Jun 9, 2024

Replies: 3 comments 9 replies

ggerganov
Jun 10, 2024
Maintainer

jgoer Jun 10, 2024
Author

ggerganov Jun 10, 2024
Maintainer

jgoer Jun 10, 2024
Author

jgoer Jun 10, 2024
Author

ggerganov Jun 10, 2024
Maintainer

iboB
Jun 10, 2024
Collaborator

jgoer Jun 10, 2024
Author

iboB
Jun 10, 2024
Collaborator

ggerganov Jun 10, 2024
Maintainer