Replies: 3 comments 9 replies
-
Could you try building |
Beta Was this translation helpful? Give feedback.
-
Right... I wrote the CUDA code to more efficiently calculate the mel spectrogram, but I just assumed the max audio length is 30 sec, and the code preallocates some work buffers with this assumption :) My bad. I'll push a fix today, but in the meantime @jgoer to get it working right now, you can change this line here: to just |
Beta Was this translation helpful? Give feedback.
-
So, @ggerganov I added a PR which addresses this: #2227 Better strategies than this exist which may be employed in the future:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am encountering an issue with Whisper.cpp. When I try to transcribe a WAV file longer than 10 minutes (around 40 MB), Whisper returns an infinite loop containing "[BLANK_AUDIO]" or sometimes "– Subtitling: Le Crayon d'oreille".
This is an output of the result a get with the commande line : ./main ./test.wav --model ./models/ggml-large-v3.bin --language AUTO
`whisper_init_from_file_with_params_no_state: loading model from '/SWAPI/ggml-medium.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_backend_init: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA A40, compute capability 8.6, VMM: yes
Device 1: NVIDIA A40, compute capability 8.6, VMM: yes
Device 2: NVIDIA A40, compute capability 8.6, VMM: yes
whisper_model_load: CUDA0 total size = 1533.14 MB
whisper_model_load: model size = 1533.14 MB
whisper_backend_init: using CUDA backend
whisper_mel_init: n_len = 3001, n_len_org = 1, n_mel = 80
whisper_init_state: kv self size = 150.99 MB
whisper_init_state: kv cross size = 150.99 MB
whisper_init_state: kv pad size = 6.29 MB
whisper_init_state: compute buffer (conv) = 28.68 MB
whisper_init_state: compute buffer (encode) = 594.22 MB
whisper_init_state: compute buffer (cross) = 7.85 MB
whisper_init_state: compute buffer (decode) = 142.09 MB
system_info: n_threads = 4 / 96 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0
main: processing './test.wav ' (19197516 samples, 1199.8 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...
whisper_mel_init: n_len = 122984, n_len_org = 119984, n_mel = 80
whisper_full_with_state: auto-detected language: en (p = 0.361893)
[00:00:00.000 --> 00:00:02.060] [BLANK_AUDIO]
[00:00:03.060 --> 00:00:05.120] [BLANK_AUDIO]
[00:00:06.120 --> 00:00:08.180] [BLANK_AUDIO]
[00:00:09.180 --> 00:00:11.240] [BLANK_AUDIO]
[00:00:12.240 --> 00:00:14.300] [BLANK_AUDIO]
[00:00:15.300 --> 00:00:17.360] [BLANK_AUDIO]
[00:00:18.360 --> 00:00:20.420] [BLANK_AUDIO]
[00:00:21.420 --> 00:00:23.480] [BLANK_AUDIO]
[00:00:24.480 --> 00:00:26.540] [BLANK_AUDIO]
...
`
I am using version v1.6.2 of Whisper.cpp. My OS is Ubuntu 20.04, NVIDIA Drivers 535.171.04 and CUDA version is 12.3. My GPU is an NVIDIA A40. I have tested with the models ggml-large-v2.bin and ggml-large-v3.bin, but the problem remains the same. When I disable the GPU version with the flag --no-gpu, the transcription proceeds without any issue.
Has anyone else encountered this problem?
Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions