Releases: ggml-org/whisper.cpp
v1.6.1
Minor release adding initial ffmpeg support in the examples #2133 (thx @WilliamTambellini)
What's Changed
- ci: Update build.yml to suppress warnings about node.js versions by @tamo in #2166
- node : add flash_attn param by @pprobst in #2170
- Add support for decoding input with ffmpeg (Linux) by @WilliamTambellini in #2133
New Contributors
- @WilliamTambellini made their first contribution in #2133
Full Changelog: v1.6.0...v1.6.1
v1.6.0
Overview
- Can optionally enable Flash Attention for faster processing on CUDA and Metal devices (#2152)
- Faster ppc64 performance (40aeeee) (not tested)
- Fix
main
slowdown bug (#2070)
Shoutout to @JohannesGaessler for contributing efficient FA CUDA kernels
Some performance numbers for this release:
M1 Pro
CPU | Config | Model | Th | FA | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|---|
M1 Pro | METAL | tiny | 1 | 0 | 39.21 | 1.74 | 0.61 | 0.04 | 22c96b4 |
M1 Pro | METAL | base | 1 | 0 | 70.76 | 2.60 | 0.93 | 0.06 | 22c96b4 |
M1 Pro | METAL | small | 1 | 0 | 217.28 | 6.42 | 2.14 | 0.17 | 22c96b4 |
M1 Pro | METAL | medium | 1 | 0 | 596.74 | 14.43 | 4.75 | 0.45 | 22c96b4 |
CPU | Config | Model | Th | FA | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|---|
M1 Pro | METAL | tiny | 1 | 1 | 30.77 | 1.59 | 0.54 | 0.03 | 22c96b4 |
M1 Pro | METAL | base | 1 | 1 | 60.42 | 2.29 | 0.81 | 0.05 | 22c96b4 |
M1 Pro | METAL | small | 1 | 1 | 183.82 | 5.12 | 1.81 | 0.14 | 22c96b4 |
M1 Pro | METAL | medium | 1 | 1 | 517.92 | 11.60 | 4.01 | 0.38 | 22c96b4 |
M2 Ultra
CPU | Config | Model | Th | FA | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|---|
M2 ULTRA | METAL | tiny | 1 | 0 | 12.32 | 1.35 | 0.49 | 0.01 | 22c96b4 |
M2 ULTRA | METAL | tiny-q5_0 | 1 | 0 | 11.65 | 1.30 | 0.51 | 0.01 | 22c96b4 |
M2 ULTRA | METAL | tiny-q5_1 | 1 | 0 | 12.08 | 1.30 | 0.51 | 0.01 | 22c96b4 |
M2 ULTRA | METAL | base | 1 | 0 | 17.58 | 1.90 | 0.76 | 0.02 | 22c96b4 |
M2 ULTRA | METAL | base-q5_0 | 1 | 0 | 18.89 | 1.86 | 0.79 | 0.02 | 22c96b4 |
M2 ULTRA | METAL | base-q5_1 | 1 | 0 | 20.69 | 1.88 | 0.79 | 0.02 | 22c96b4 |
M2 ULTRA | METAL | small | 1 | 0 | 49.32 | 3.85 | 1.71 | 0.05 | 22c96b4 |
M2 ULTRA | METAL | small-q5_0 | 1 | 0 | 54.91 | 3.81 | 1.82 | 0.06 | 22c96b4 |
M2 ULTRA | METAL | small-q5_1 | 1 | 0 | 54.92 | 3.81 | 1.79 | 0.06 | 22c96b4 |
M2 ULTRA | METAL | medium | 1 | 0 | 134.34 | 8.04 | 3.82 | 0.13 | 22c96b4 |
M2 ULTRA | METAL | medium-q5_0 | 1 | 0 | 151.68 | 7.59 | 4.07 | 0.14 | 22c96b4 |
M2 ULTRA | METAL | medium-q5_1 | 1 | 0 | 151.58 | 7.67 | 4.07 | 0.14 | 22c96b4 |
M2 ULTRA | METAL | medium-dis | 1 | 0 | 120.82 | 1.07 | 0.41 | 0.02 | 22c96b4 |
M2 ULTRA | METAL | large-v2 | 1 | 0 | 235.63 | 12.27 | 5.85 | 0.22 | 22c96b4 |
M2 ULTRA | METAL | large-v2-q5_0 | 1 | 0 | 273.38 | 11.17 | 6.40 | 0.26 | 22c96b4 |
M2 ULTRA | METAL | large-v2-q5_1 | 1 | 0 | 272.44 | 11.32 | 6.29 | 0.26 | 22c96b4 |
M2 ULTRA | METAL | large-v2-dis | 1 | 0 | 212.51 | 1.20 | 0.47 | 0.02 | 22c96b4 |
CPU | Config | Model | Th | FA | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|---|
M2 ULTRA | METAL | tiny | 1 | 1 | 9.07 | 1.33 | 0.45 | 0.01 | 22c96b4 |
M2 ULTRA | METAL | tiny-q5_0 | 1 | 1 | 9.74 | 1.33 | 0.47 | 0.01 | 22c96b4 |
M2 ULTRA | METAL | tiny-q5_1 | 1 | 1 | 8.93 | 1.31 | 0.46 | 0.01 | 22c96b4 |
M2 ULTRA | METAL | base | 1 | 1 | 15.75 | 1.87 | 0.71 | 0.02 | 22c96b4 |
M2 ULTRA | METAL | base-q5_0 | 1 | 1 | 17.04 | 1.83 | 0.74 | 0.02 | 22c96b4 |
M2 ULTRA | METAL | base-q5_1 | 1 | 1 | 17.17 | 1.83 | 0.74 | 0.02 | 22c96b4 |
M2 ULTRA | METAL | small | 1 | 1 | 42.33 | 3.64 | 1.60 | 0.05 | 22c96b4 |
M2 ULTRA | METAL | small-q5_0 | 1 | 1 | 47.61 | 3.63 | 1.70 | 0.05 | 22c96b4 |
M2 ULTRA | METAL | small-q5_1 | 1 | 1 | 47.70 | 3.66 | 1.68 | 0.05 | 22c96b4 |
M2 ULTRA | METAL | medium | 1 | 1 | 114.42 | 7.53 | 3.55 | 0.11 | 22c96b4 |
M2 ULTRA | METAL | medium-q5_0 | 1 | 1 | 132.63 | 7.02 | 3.77 | 0.13 | 22c96b4 |
M2 ULTRA | METAL | medium-q5_1 | 1 | 1 | 132.28 | 7.10 | 3.76 | 0.13 | 22c96b4 |
M2 ULTRA | METAL | medium-dis | 1 | 1 | 102.34 | 1.01 | 0.42 | 0.01 | 22c96b4 |
M2 ULTRA | METAL | large-v2 | 1 | 1 | 203.01 | 11.03 | 5.45 | 0.20 | 22c96b4 |
M2 ULTRA | METAL | large-v2-q5_0 | 1 | 1 | 240.05 | 10.18 | 5.98 | 0.23 | 22c96b4 |
M2 ULTRA | METAL | large-v2-q5_1 | 1 | 1 | 239.22 | 10.23 | 5.87 | 0.23 | 22c96b4 |
M2 ULTRA | METAL | large-v2-dis | 1 | 1 | 181.14 | 1.14 | 0.48 | 0.02 | 22c96b4 |
Ryzen 9 5950X + RTX 2060
CPU | Config | Model | Th | FA | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|---|
Ryzen 9 5950X | AVX2 | tiny | 8 | 0 | 195.29 | 1.57 | 0.51 | 0.26 | 22c96b4 |
Ryzen 9 5950X | AVX2 | tiny-q5_0 | 8 | 0 | 213.33 | 1.10 | 0.50 | 0.30 | 22c96b4 |
Ryzen 9 5950X | AVX2 | tiny-q5_1 | 8 | 0 | 219.38 | 1.18 | 0.53 | 0.32 | 22c96b4 |
Ryzen 9 5950X | AVX2 | base | 8 | 0 | 424.85 | 3.71 | 1.03 | 0.46 | 22c96b4 |
Ryzen 9 5950X | AVX2 | base-q5_0 | 8 | 0 | 473.61 | 1.81 | 0.82 | 0.52 | 22c96b4 |
Ryzen 9 5950X | AVX2 | base-q5_1 | 8 | 0 | 484.14 | 1.92 | 0.85 | 0.56 | 22c96b4 |
Ryzen 9 5950X | AVX2 | small | 8 | 0 | 1458.32 | 12.66 | 3.09 | 1.26 | 22c96b4 |
Ryzen 9 5950X | AVX2 | small-q5_0 | 8 | 0 | 1673.22 | 6.42 | 2.18 | 1.45 | 22c96b4 |
Ryzen 9 5950X | AVX2 | small-q5_1 | 8 | 0 | 1724.78 | 6.72 | 2.32 | 1.52 | 22c96b4 |
Ryzen 9 5950X | AVX2 | medium | 8 | 0 | 4333.87 | 36.80 | 8.56 | 3.37 | 22c96b4 |
Ryzen 9 5950X | AVX2 | medium-q5_0 | 8 | 0 | 5194.09 | 19.21 | 5.71 | 3.97 | 22c96b4 |
Ryzen 9 5950X | AVX2 | medium-q5_1 | 8 | 0 | 5450.39 | 20.01 | 5.99 | 4.17 | 22c96b4 |
Ryzen 9 5950X | AVX2 | medium-dis | 8 | 0 | 3995.19 | 5.08 | 1.21 | 0.55 | 22c96b4 |
Ryzen 9 5950X | AVX2 | large-v2 | 8 | 0 | 8056.16 | 69.74 | 16.11 | 6.13 | 22c96b4 |
Ryzen 9 5950X | AVX2 | large-v2-q5_0 | 8 | 0 | 9799.58 | 35.16 | 10.49 | 7.28 | 22c96b4 |
Ryzen 9 5950X | AVX2 | large-v2-q5_1 | 8 | 0 | ms | 36.74 | 11.02 | 7.65 | 22c96b4 |
Ryzen 9 5950X | AVX2 | large-v2-dis | 8 | 0 | 7490.03 | 7.40 | 1.70 | 0.72 | 22c96b4 |
GPU | Config | Model | Th | FA | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|---|
RTX 2060 | AVX2 CUDA | tiny | 8 | 0 | 12.54 | 0.93 | 0.29 | 0.02 | 22c96b4 |
RTX 2060 | AVX2 CUDA | tiny-q5_0 | 8 | 0 | 12.73 | 0.98 | 0.24 | 0.02 | 22c96b4 |
RTX 2060 | AVX2 CUDA | tiny-q5_1 | 8 | 0 | 12.72 | 0.99 | 0.24 | 0.02 | 22c96b4 |
RTX 2060 | AVX2 CUDA | base | 8 | 0 | 24.14 | 1.28 | 0.41 | 0.03 | 22c96b4 |
RTX 2060 | AVX2 CUDA | base-q5_0 | 8 | 0 | 24.58 | 1.38 | 0.35 | 0.03 | 22c96b4 |
RTX 2060 | AVX2 CUDA | base-q5_1 | 8 | 0 | 24.58 | 1.37 | 0.35 | 0.03 | 22c96b4 |
RTX 2060 | AVX2 CUDA | small | 8 | 0 | 74.70 | 2.91 | 0.84 | 0.07 | 22c96b4 |
RTX 2060 | AVX2 CUDA | small-q5_0 | 8 | 0 | 76.12 | 2.84 | 0.77 | 0.08 | 22c96b4 |
RTX 2060 | AVX2 CUDA | small-q5_1 | 8 | 0 | 76.14 | 2.84 | 0.76 | 0.08 | 22c96b4 |
RTX 2060 | AVX2 CUDA | medium | 8 | 0 | 200.69 | 6.46 | 1.83 | 0.17 | 22c96b4 |
RTX 2060 | AVX2 CUDA | medium-q5_0 | 8 | 0 | 204.80 | 5.90 | 1.65 | 0.19 | 22c96b4 |
RTX 2060 | AVX2 CUDA | medium-q5_1 | 8 | 0 | 205.61 | 5.85 | 1.61 | 0.19 | 22c96b4 |
RTX 2060 | AVX2 CUDA | medium-dis | 8 | 0 | 186.17 | 0.86 | 0.24 | 0.02 | 22c96b4 |
RTX 2060 | AVX2 CUDA | large-v2 | 8 | 0 | 347.22 | 10.36 | 2.82 | 0.29 | 22c96b4 |
RTX 2060 | AVX2 CUDA | large-v2-q5_0 | 8 | 0 | 357.06 | 8.81 | 2.58 | 0.34 | 22c96b4 |
RTX 2060 | AVX2 CUDA | large-v2-q5_1 | 8 | 0 | 356.97 | 8.62 | 2.49 | 0.33 | 22c96b4 |
RTX 2060 | AVX2 CUDA | large-v2-dis | 8 | 0 | 318.05 | 1.03 | 0.34 | 0.04 | 22c96b4 |
GPU | Config | Model | Th | FA | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|---|
RTX 2060 | AVX2 CUDA | tiny | 8 | 1 | 7.21 | 0.76 | 0.29 | 0.02 | 22c96b4 |
RTX 2060 | AVX2 CUDA | tiny-q5_0 | 8 | 1 | 7.42 | 0.82 | 0.18 | 0.02 | 22c96b4 |
RTX 2060 | AVX2 CUDA | tiny-q5_1 | 8 | 1 | 7.38 | 0.82 | 0.18 | 0.02 | 22c96b4 |
RTX 2060 | AVX2 CUDA | ... |
v1.5.5
Overview
Many small incremental updates + Token level timestamps with DTW by @denersc in #1485
Feedback is welcome!
Full Changelog: v1.5.4...v1.5.5
What's Changed
- server : fix server temperature + add temperature_inc by @ggerganov in #1729
- main : add cli option to disable system prints by @ggerganov in #1740
- server: add request path by @eschmidbauer in #1741
- Optional Piper TTS support for talk-llama example. by @RhinoDevel in #1749
- fix/1748 by @nank1ro in #1750
- Don't compute timestamps when not printing them. by @ghindle in #1755
- Add more parameters to server api by @ghindle in #1754
- Add SetInitialPrompt method to go bindings by @blib in #1753
- ggml : fix 32-bit ARM compat for IQ2_XS by @ggerganov in #1758
- refactor: get all scripts to be POSIX Compliant by @sonphantrung in #1725
- whisper : load the model into multiple buffers of max size 1GB by @ggerganov in #1763
- rebase against your -np changes (thx) and add better python file to be used on the command line or as library by @contractorwolf in #1744
- examples/talk-llama: Add optional commandline parameter to set the bot name. by @RhinoDevel in #1764
- server : fix building and simplify lib deps on Windows by @przemoc in #1772
- talk-llama: optional wake-up command and audio confirmation by @Rakksor in #1765
- examples/server: implement "verbose_json" format with token details by @rmmh in #1781
- whisper.android: Return output from benchmarks by @luciferous in #1785
- libwhisper.so should be position independent by @trixirt in #1792
- Docs: try to make model options / model install methods clearer by @mrienstra in #1806
- common : fix input buffer check by @ggerganov in #1812
- Update Makefile by @jwijffels in #1813
- Add fields to
verbose_json
response and show examples on the home page by @JacobLinCool in #1802 - common: fix wav buffer detection by @JacobLinCool in #1819
- Add macOS deployment target option to Makefile by @didzis in #1839
- Expose CUDA device setting in public API by @didzis in #1840
- whisper.android: How to build with CLBlast by @luciferous in #1809
- server: Allow CORS request with authorization headers by @valenting in #1850
- Embed Metal library source into compiled binary by @didzis in #1842
- added audio_ctx argument to main and server examples by @dscripka in #1857
- whisper : fix external encoder by @ggerganov in #1860
- swift : package no longer use ggml dependency by @ggerganov in #1861
- fix openvino setup docs by @jumpers775 in #1874
- clean up common code in examples by @felrock in #1871
- main : check if input files exist before proceeding by @Theldus in #1872
- Linking issue fix via Makefile when CUBLAS enabled in the WSL #1876 by @lbluep in #1878
- main : fix file existence check in main.cpp by @Theldus in #1889
- openvino : fix convert-whisper-to-openvino.py for v2023.0.0 (#1870) by @st-gr in #1890
- ggml : 32-bit arm compat by @ggerganov in #1891
- Add SYCL logic in whisper by @abhilash1910 in #1863
- talk and talk-llama: Pass text_to_speak as a file by @tamo in #1865
- Stream.wasm: Fix invalid memory access when no segments are returned by @Andrews54757 in #1902
- Update README to Recommend MacOS Sonoma for Core ML to avoid hallucination by @gavin1818 in #1917
- Add library versioning by @kenneth-ge in #1352
- Fix SF(segment fault) issue in Android JNI by @zhouwg in #1929
- Fix typo in source file whisper.cpp by @zhouwg in #1925
- bench:fix typo by @zhouwg in #1933
- Auto lowercase language parameter by @F1L1Pv2 in #1928
- ggml : try fix 32-bit arm compat by @ggerganov in #1938
- whisper : make beam candidate sort more stable by @josharian in #1943
- bindings/go : add linker flags to make metal work by @josharian in #1944
- whisper : improve beam search candidate diversity by @josharian in #1947
- whisper : document whisper_batch.n_seq_id by @josharian in #1942
- Rename --audio-context to --audio-ctx, as per help text by @joliss in #1953
- [DRAFT] Token level timestamps with DTW (#375) by @denersc in #1485
- Fedora dependencies needed (SDL2) by @Man2Dev in #1970
- libcuda.so.1 in PATH in Docker Container by @tiagofassoni in #1966
- ruby : fix build by @ggerganov in #1980
- Improve support for distil-large-v3 by @sanchit-gandhi in #1982
- whisper : improve handling of prompts by @ggerganov in #1981
- sync : ggml by @ggerganov in #2001
- Implemented command-style grammar in the main example. by @ulatekh in #1998
- Use pkg-config for OpenBLAS by @przemoc in #1778
- ci : add building in MSYS2 environments (Windows) by @przemoc in #1994
- Support CUDA versions < 11.1 by @primenko-v in #2020
- Create solution folders in the CMake build by @ulatekh in #2004
- Allow a regular expression to describe tokens to suppress by @ulatekh in #1997
- "main" example now allows a response-file as the sole parameter by @ulatekh in #2019
- Support for CPU BLAS build via Intel MKL by @slashlib in #2024
- Set stdin to binary mode on Windows. Fixes #2023 by @rotemdan in #2025
- Fix file-handle leak in read_wav() by @ulatekh in #2026
- Fix DTW memory access by @bradmurray-dt in #2012
- whisper: update grammar-parser.cpp by @eltociear in #2058
- fix missing reference to "model" variable in actual shell command run in whisper.nvim by @sixcircuit in #2049
- build : detect AVX512 in Makefile, add AVX512 option in CMake by @didzis in #2043
- feature/no timestamps node by @pprobst in #2048
- Update embedded Metal library generation process to include dependency by @didzis in #2045
- server.cpp: add dtw by @eschmidbauer in #2044
New Contributors
- @eschmidbauer made their first contribution in #1741
- @RhinoDevel made their first contribution in #1749
- @nank1ro made their first contribution in #1750
- @ghindle made their first contribution in #1755
- @blib made their first contribution in #1753
- @sonphantrung made their first contribution in #1725
- @contractorwolf made their first contribution in #1744
- @Rakksor made their first contribution in #1765
- @rmmh made their f...
v1.5.4
v1.5.3
Overview
Minor maintenance release:
- Fix CUDA issues where the transcription produces garbage
- FIX quantized models to work with CUDA backend
- Allow to use
whisper.cpp
andllama.cpp
together in SwiftUI projects
What's Changed
- Update bench.py by @ForkedInTime in #1655
- cmake : Resolve quantized model issue when CUBLAS enabled by @bobqianic in #1667
- examples : Revert CMakeLists.txt for talk-llama by @bobqianic in #1669
- CI : Add coverage for talk-llama when WHISPER_CUBLAS=1 by @bobqianic in #1672
- ci: build and push docker image by @OpenWaygate in #1674
- sync : ggml (ggml_scale, ggml_row_size, etc.) by @ggerganov in #1677
- Replace
WHISPER_PRINT_DEBUG
withWHISPER_LOG_DEBUG
by @bobqianic in #1681 - download: Fix large q5 model name by @dimopep in #1695
- sync : ggml (VMM, sync-ggml-am.sh, dotprod ARM fixes) by @ggerganov in #1691
- whisper : replace
tensor->n_dims
withggml_n_dims(tensor)
by @bobqianic in #1694 - Build with CLBlast by @tamo in #1576
- docker : Fix the Publishing of the CUDA Docker Image by @bobqianic in #1704
- emscripten: fix "Stack Overflow!" by @Huguet57 in #1713
- sync : ggml by @ggerganov in #1717
- Add error handling to graph_compute by @finnvoor in #1714
- Updates Package.swift to use ggml as package dependency by @1-ashraful-islam in #1701
New Contributors
- @ForkedInTime made their first contribution in #1655
- @OpenWaygate made their first contribution in #1674
- @dimopep made their first contribution in #1695
- @Huguet57 made their first contribution in #1713
- @1-ashraful-islam made their first contribution in #1701
Full Changelog: v1.5.2...v1.5.3
v1.5.2
Overview
Minor maintenance release:
- Re-enable CPU BLAS processing after fixing a regression (#1583)
Add new example: wchess
wchess-0.mp4
Shoutout to @fraxy-v (implementation) and @ejones (grammar) for making it work!
What's Changed
- automatically convert audio on the server by @sapoepsilon in #1539
- CI : Rectify the Clang-Related workflow issues by @bobqianic in #1551
- CI : Add CUDA 11.8.0 support by @bobqianic in #1554
- Update main program help info by @bebound in #1560
- Set default CORS headers to allow all by @kasumi-1 in #1567
- cmake : install required ggml.h header by @gjasny in #1568
- Backport .srt output format to examples/server by @osdrv in #1565
- Added support for .vtt format to Whisper server by @aleksanderandrzejewski in #1578
- ggml : re-enable blas for src0 != F32 by @ggerganov in #1583
- Fix 32-bit compiler warning by @Digipom in #1575
- Remove #if arch(arm) check in Swift Package Manager by @finnvoor in #1561
- Pass max-len argument to server wparams by @osdrv in #1574
- sync : ggml (new ops, new backend, etc) by @ggerganov in #1602
- Fix
ggml_metal_log
on Intel macs by @finnvoor in #1606 - Update CMakeLists.txt by @Kreijstal in #1615
- target windows 8 or above for prefetchVirtualMemory in llama-talk by @Kreijstal in #1617
- sync : ggml (Metal fixes, new ops, tests) by @ggerganov in #1633
- wchess: whisper assisted chess by @fraxy-v in #1595
New Contributors
- @sapoepsilon made their first contribution in #1539
- @bebound made their first contribution in #1560
- @kasumi-1 made their first contribution in #1567
- @gjasny made their first contribution in #1568
- @osdrv made their first contribution in #1565
- @aleksanderandrzejewski made their first contribution in #1578
- @Kreijstal made their first contribution in #1615
- @fraxy-v made their first contribution in #1595
Full Changelog: v1.5.1...v1.5.2
v1.5.1
Overview
Minor update:
- With Metal, auto-fallback to CPU if device does not support Apple7 family
- Add server example
What's Changed
- ISSUE-1329: replace " with ' so it doesn't try to execute code in backticks by @spullara in #1364
- sync : ggml (ggml-alloc + linker + gguf fixes) by @ggerganov in #1501
- Fixed with_state methods, to use the correct state by @sandrohanea in #1519
- #1517 Redistribute CUDA DLLs by @tamo in #1522
- whisper : reuse whisper_decode_with_state by @ggerganov in #1521
- sdl : fix audio callback by @ggerganov in #1523
- update deprecated example by @MightyStud in #1529
- Super Simple Whisper Server by @felrock in #1380
- Close file after writing in server application by @felrock in #1533
- bench : multi-thread memcpy by @ggerganov in #1534
- Change temp file name for server application by @felrock in #1535
- Fixed Makefile for MacOS ARM 64 Go bindings by @gleicon in #1530
- Fixed metal build on macos-latest by @sandrohanea in #1544
- fix(server): typo in temperature parameter by @Okabintaro in #1545
- Request to add a new function to get the full language name by @bradmit in #1546
- server : add --print-realtime param by @ecneladis in #1541
- cuda : sync some minor stuff from llama.cpp by @ggerganov in #1548
- metal : add backend function to check device family support by @ggerganov in #1547
New Contributors
- @spullara made their first contribution in #1364
- @MightyStud made their first contribution in #1529
- @felrock made their first contribution in #1380
- @gleicon made their first contribution in #1530
- @Okabintaro made their first contribution in #1545
- @bradmit made their first contribution in #1546
- @ecneladis made their first contribution in #1541
Full Changelog: v1.5.0...v1.5.1
v1.5.0
Overview
This major release includes the following changes:
- Full GPU processing of the Encoder and the Decoder with CUDA and Metal is now supported
- Efficient beam-search implementation via batched decoding and unified KV cache
- Full quantization support of all available
ggml
quantization types - Support for grammar constrained sampling
- Support for Distil Whisper models
- Support for Whisper Large-v3
and more
Full GPU support
On Apple Silicon, GPU support has been available to a large extend since 15 Sep. However, part of the Encoder was still being executed on the CPU due to lack of MSL kernels for the convolution operations. These kernels are now available resulting in additional speed-up of the Encoder in this release:
Encoder performance on Apple M1 Max - before and after (plot by @dreness)
For NVIDIA hardware, the entire computation can now be offloaded to the GPU which results in significant performance boost. For detailed performance breakdown, checkout the Benchmarks section below.
The GPU processing on Apple Silicon is enabled by default, while for NVIDIA you need to build with WHISPER_CUBLAS=1
:
# Apple Silicon
make
# NVIDIA
WHISPER_CUBLAS=1 make
Implementation: #1472
Special credits to: @FSSRepo, @slaren
Batched decoding + efficient Beam Search
At last, whisper.cpp
now supports efficient Beam Search decoding. The missing piece was the implementation of batched decoding, which now follows closely the unified KV cache idea from llama.cpp. On modern NVIDIA hardware, the performance with 5 beams is the same as 1 beam thanks to the large amount of computing power available. With Metal, the speed with 5 beams is a bit slower compared to 1 beam, but it is significantly faster compared to 5x times the time for single batch which was observed with the old naive implementation.
Beam Search is now enabled by default in whisper.cpp
to match the OG implementation of OpenAI Whisper. For more performance details, checkout the Benchmarks section below.
Implementation: #1486
Quantization support
All ggml
quantization types are now supported. Quantization mixtures for Whisper model can be implemented. It's still unclear how the quality is affected from the quantization - this is an interesting area which can be explored in the future.
Grammar sampling
The decoder output can now be constrained with a GBNF grammar. This can be a useful technique for further improving the transcription quality in situations where the set of possible phrases are limited.
whisper-chess.mp4
Implementation: #1229
Special credits to @ejones
Distil Whisper
Recently, Distil Whisper models have been released: https://huggingface.co/distil-whisper
whisper.cpp
offers support for these models, although it still lacks full implementation of the proposed chunking strategy. Performance details for distilled models are included in the Benchmarks section below.
Implementation: #1424
Whisper Large-v3
Recently, OpenAI released a new version 3 of the Large model: openai/whisper#1761
Implementation: #1444
Benchmarks
Below is a breakdown of the performance of whisper.cpp
on Apple Silicon, NVIDIA and CPU. The tables show the Encoder and Decoder speed in ms/tok
. The Dec.
column corresponds to batch size 1. The Bch5
column corresponds to batch size 5. The PP
column corresponds to batch size 128.
For optimal Beam Search performance, the Bch5
number should be 5 times smaller than Dec.
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
M2 Ultra | METAL | tiny | 1 | 11.14 | 1.40 | 0.49 | 0.01 | ccc85b4 |
M2 Ultra | METAL | tiny-q5_0 | 1 | 11.51 | 1.41 | 0.52 | 0.01 | ccc85b4 |
M2 Ultra | METAL | tiny-q5_1 | 1 | 12.21 | 1.41 | 0.52 | 0.01 | ccc85b4 |
M2 Ultra | METAL | base | 1 | 20.21 | 2.05 | 0.77 | 0.02 | ccc85b4 |
M2 Ultra | METAL | base-q5_0 | 1 | 19.89 | 1.96 | 0.81 | 0.02 | ccc85b4 |
M2 Ultra | METAL | base-q5_1 | 1 | 20.14 | 2.02 | 0.81 | 0.02 | ccc85b4 |
M2 Ultra | METAL | small | 1 | 51.01 | 3.97 | 1.74 | 0.05 | ccc85b4 |
M2 Ultra | METAL | small-q5_0 | 1 | 56.86 | 4.09 | 1.85 | 0.06 | ccc85b4 |
M2 Ultra | METAL | small-q5_1 | 1 | 56.81 | 4.14 | 1.85 | 0.06 | ccc85b4 |
M2 Ultra | METAL | medium | 1 | 141.21 | 8.47 | 3.98 | 0.13 | ccc85b4 |
M2 Ultra | METAL | medium-q5_0 | 1 | 160.56 | 8.27 | 4.18 | 0.14 | ccc85b4 |
M2 Ultra | METAL | medium-q5_1 | 1 | 160.52 | 8.40 | 4.15 | 0.14 | ccc85b4 |
M2 Ultra | METAL | medium-dis | 1 | 128.14 | 1.13 | 0.43 | 0.02 | ccc85b4 |
M2 Ultra | METAL | large-v2 | 1 | 248.73 | 11.96 | 6.08 | 0.22 | ccc85b4 |
M2 Ultra | METAL | large-v2-q5_0 | 1 | 286.31 | 11.99 | 6.60 | 0.26 | ccc85b4 |
M2 Ultra | METAL | large-v2-q5_1 | 1 | 284.56 | 12.42 | 6.47 | 0.26 | ccc85b4 |
M2 Ultra | METAL | large-v2-dis | 1 | 224.31 | 1.26 | 0.49 | 0.02 | ccc85b4 |
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
M2 Ultra | COREML METAL | tiny | 1 | 7.60 | 1.41 | 0.50 | 0.01 | ccc85b4 |
M2 Ultra | COREML METAL | base | 1 | 11.90 | 2.07 | 0.78 | 0.02 | ccc85b4 |
M2 Ultra | COREML METAL | small | 1 | 32.19 | 4.10 | 1.78 | 0.05 | ccc85b4 |
M2 Ultra | COREML METAL | medium | 1 | 94.43 | 8.40 | 3.89 | 0.12 | ccc85b4 |
M2 Ultra | COREML METAL | large-v2 | 1 | 179.78 | 12.12 | 6.07 | 0.22 | ccc85b4 |
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
NVIDIA V100 | BLAS CUDA | tiny | 1 | 8.84 | 1.62 | 0.33 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | tiny-q5_0 | 1 | 8.43 | 1.19 | 0.31 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | tiny-q5_1 | 1 | 8.41 | 1.19 | 0.29 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | base | 1 | 14.79 | 2.31 | 0.46 | 0.03 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | base-q5_0 | 1 | 15.05 | 1.66 | 0.44 | 0.03 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | base-q5_1 | 1 | 15.01 | 1.68 | 0.46 | 0.03 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | small | 1 | 40.30 | 4.37 | 0.88 | 0.05 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | small-q5_0 | 1 | 41.17 | 3.11 | 0.94 | 0.05 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | small-q5_1 | 1 | 41.12 | 3.11 | 0.82 | 0.05 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium | 1 | 104.93 | 10.06 | 1.77 | 0.11 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium-q5_0 | 1 | 107.11 | 6.13 | 2.07 | 0.12 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium-q5_1 | 1 | 107.91 | 6.21 | 1.77 | 0.12 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium-dis | 1 | 103.45 | 1.11 | 0.24 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | large-v2 | 1 | 171.55 | 15.76 | 2.62 | 0.17 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | large-v2-q5_0 | 1 | 176.27 | 8.61 | 3.17 | 0.19 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | large-v2-q5_1 | 1 | 176.23 | 8.67 | 2.59 | 0.19 | ccc85b4 |
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
AMD Ryzen 9 5950X | AVX2 | tiny | 8 | 197.47 | 1.22 | 0.44 | 0.25 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | tiny-q5_0 | 8 | 222.92 | 0.87 | 0.45 | 0.30 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | tiny-q5_1 | 8 | 221.25 | 0.89 | 0.45 | 0.30 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | base | 8 | 427.14 | 3.11 | 0.88 | 0.43 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | base-q5_0 | 8 | 474.96 | 1.41 | 0.72 | 0.51 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | base-q5_1 | 8 | 485.05 | 1.48 | 0.73 | 0.52 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | small | 8 | 1470.51 | 11.70 | 2.89 | 1.21 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | small-q5_0 | 8 | 1700.43 | 5.48 | 1.98 | 1.41 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | small-q5_1 | 8 | 1719.03 | 5.79 | 2.02 | 1.42 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | medium | 8 | 4417.70 | 35.13 | 8.14... |
v1.4.3
This is a minor release, the main reason for which is that there hasn't been an official release for a few months now and some small things have accumulated on the master
branch that would be nice to be upstreamed. I am planning a major v1.5.0
release with some new and long-waited functionality soon:
- Full CUDA offloading
- Efficient Beam-Search implementation
- Grammar support
The current version v1.4.3
should be considered in beta as I haven't worked intensively on whisper.cpp
recently and there might be some issues that made their way in the code. I'll try to polish things in the next days and prepare a stable v1.5.0
release. In the meantime, any feedback will be highly appreciated.
Detailed API changes, features and new contributor recognitions will be included in the v1.5.0
release.
v1.4.0
Overview
This is a new major release adding integer quantization and partial GPU (NVIDIA) support
Integer quantization
This allows the ggml
Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.
- Supported quantization modes:
Q4_0
,Q4_1
,Q4_2
,Q5_0
,Q5_1
,Q8_0
- Implementation details: #540
- Usage instructions: README
- All WASM examples now support
Q5
quantized models: https://whisper.ggerganov.com
Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:
LLaMA quantization (measured on M1 Pro)
Model | Measure | F16 | Q4_0 | Q4_1 | Q4_2 | Q5_0 | Q5_1 | Q8_0 |
---|---|---|---|---|---|---|---|---|
7B | perplexity | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0139 | 5.9934 | 5.9571 |
7B | file size | 13.0G | 4.0G | 4.8G | 4.0G | 4.4G | 4.8G | 7.1G |
7B | ms/tok @ 4th | 128 | 56 | 61 | 84 | 91 | 95 | 75 |
7B | ms/tok @ 8th | 128 | 47 | 55 | 48 | 53 | 59 | 75 |
7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
13B | perplexity | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.2768 | 5.2582 | 5.2458 |
13B | file size | 25.0G | 7.6G | 9.1G | 7.6G | 8.4G | 9.1G | 14G |
13B | ms/tok @ 4th | 239 | 104 | 113 | 160 | 176 | 185 | 141 |
13B | ms/tok @ 8th | 240 | 85 | 99 | 97 | 108 | 117 | 147 |
13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
ref: https://github.com/ggerganov/llama.cpp#quantization
RWKV quantization
Format | Perplexity (169M) | Latency, ms (1.5B) | File size, GB (1.5B) |
---|---|---|---|
Q4_0 |
17.507 | 76 | 1.53 |
Q4_1 |
17.187 | 72 | 1.68 |
Q4_2 |
17.060 | 85 | 1.53 |
Q5_0 |
16.194 | 78 | 1.60 |
Q5_1 |
15.851 | 81 | 1.68 |
Q8_0 |
15.652 | 89 | 2.13 |
FP16 |
15.623 | 117 | 2.82 |
FP32 |
15.623 | 198 | 5.64 |
ref: ggml-org/ggml#89 (comment)
This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2
GPU support via cuBLAS
Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.
This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together
This release remains in "beta" stage as I haven't verified that everything works as expected.
What's Changed
- Updated escape_double_quotes() Function by @tauseefmohammed2 in #776
- examples : add missing #include by @pH5 in #798
- Flush upon finishing inference by @tarasglek in #811
- Escape quotes in csv output by @laytan in #815
- C++11style by @wuyudi in #768
- Optionally allow a Core ML build of Whisper to work with or without Core ML models by @Canis-UK in #812
- add some tips about in the readme of the android project folder by @Zolliner in #816
- whisper: Use correct seek_end when offset is used by @ThijsRay in #833
- ggml : fix 32-bit ARM NEON by @ggerganov in #836
- Add CUDA support via cuBLAS by @ggerganov in #834
- Integer quantisation support by @ggerganov in #540
New Contributors
- @tauseefmohammed2 made their first contribution in #776
- @pH5 made their first contribution in #798
- @tarasglek made their first contribution in #811
- @laytan made their first contribution in #815
- @wuyudi made their first contribution in #768
- @Canis-UK made their first contribution in #812
- @Zolliner made their first contribution in #816
- @ThijsRay made their first contribution in #833
Full Changelog: v1.3.0...v1.4.0