feat: improve asr with mp3 and gemini #60

JacobLinCool · 2024-12-23T18:59:11Z

resolve #54

Audio Transcription Latency Improvement Report

Overview

In educational environments, network stability is often inconsistent, leading to delays in audio transcription processes. While our current goal may not focus on real-time transcription, achieving a round-trip latency of less than 5 seconds (from audio submission to receiving transcription results) is critical for improving the user experience.

Current Test Results

Audio Formats and Sizes:

WAV: 1,920,126 bytes
MP3: 160,557 bytes
Duration: 20,000 ms
Conversion Time: 325 ms
WASM FFmpeg PCM-to-MP3 Transcoder: 1.2 MB

our fork

Latency Results Summary:

Service	Format	Cold Start (ms)	Warm Start (ms)	Unstable Network (Fast 4G, ms)	Size (bytes)
HuggingFace Whisper	WAV	23,777	12,616	40,304	1,920,126
HuggingFace Whisper	MP3	24,546	13,191	15,214	160,557
Gemini Flash	WAV	N/A	3,264	14,290	1,920,126
Gemini Flash	MP3	N/A	3,590	4,183	160,557

Key Observations

MP3 Efficiency: MP3 format significantly reduces file size (over 10x smaller than WAV), making it far more efficient for transmission in unstable network environments.
Local Environment Limitations: In a controlled local environment (frontend and backend running on the same machine), the MP3 size advantage did not show significant latency improvements.
Network Impact: Under network-constrained environments (e.g., Fast 4G simulation), MP3 outperformed WAV significantly, reducing transmission latency to 4,183 ms compared to WAV's 14,290 ms.
WASM FFmpeg Transcoder: The lightweight (1.2 MB) WASM FFmpeg PCM-to-MP3 transcoder provides efficient in-browser encoding with minimal overhead.
Service Performance: Gemini Flash consistently outperformed Whisper in both WAV and MP3 scenarios.
Acceleration Gains: Gemini Flash demonstrated a 6-7x improvement in processing speed compared to HuggingFace Whisper.

Key Improvement Areas

1. Adopt MP3 over WAV

MP3 minimizes transmission latency in network-constrained environments.
In-browser MP3 transcoding using WASM FFmpeg (1.2 MB) is lightweight and efficient.

2. Switch to a Faster Transcription Service

HuggingFace Whisper suffers from high latency in cold and warm start scenarios.
Gemini Flash offers significantly better latency performance and is better suited for real-time or near-real-time applications.

Conclusion

Achieving a round-trip latency of under 5 seconds is attainable through improvements in audio format and transcription service. These targeted changes will significantly enhance the user experience, particularly in network-limited educational environments.

Copilot reviewed 6 out of 9 changed files in this pull request and generated no comments.

Files not reviewed (3)

package.json: Language not supported
src/lib/components/session/ParticipantView.svelte: Language not supported
src/routes/test/stt/+page.svelte: Language not supported

Comments suppressed due to low confidence (1)

src/lib/stt/gemini.ts:24

[nitpick] The error message 'Failed to transcribe audio' could be more descriptive. Consider including more details to aid in debugging.

throw new Error('Failed to transcribe audio');

feat: improve asr with mp3 and gemini

950c241

JacobLinCool self-assigned this Dec 23, 2024

Copilot bot review requested due to automatic review settings December 23, 2024 18:59

Copilot AI reviewed Dec 23, 2024

View reviewed changes

chore: add GOOGLE_GENAI_API_KEY to env example

e5f40b8

JacobLinCool merged commit 69b2d77 into main Dec 23, 2024
4 checks passed

JacobLinCool deleted the asr-enhancement branch December 23, 2024 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve asr with mp3 and gemini #60

feat: improve asr with mp3 and gemini #60

JacobLinCool commented Dec 23, 2024 •

edited

Loading

feat: improve asr with mp3 and gemini #60

feat: improve asr with mp3 and gemini #60

Conversation

JacobLinCool commented Dec 23, 2024 • edited Loading

Overview

Current Test Results

Key Observations

Key Improvement Areas

1. Adopt MP3 over WAV

2. Switch to a Faster Transcription Service

Conclusion

Choose a reason for hiding this comment

JacobLinCool commented Dec 23, 2024 •

edited

Loading