Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve asr with mp3 and gemini #60

Merged
merged 2 commits into from
Dec 23, 2024
Merged

feat: improve asr with mp3 and gemini #60

merged 2 commits into from
Dec 23, 2024

Conversation

JacobLinCool
Copy link
Member

@JacobLinCool JacobLinCool commented Dec 23, 2024

resolve #54

Audio Transcription Latency Improvement Report


Overview

In educational environments, network stability is often inconsistent, leading to delays in audio transcription processes. While our current goal may not focus on real-time transcription, achieving a round-trip latency of less than 5 seconds (from audio submission to receiving transcription results) is critical for improving the user experience.


Current Test Results

Audio Formats and Sizes:

  • WAV: 1,920,126 bytes
  • MP3: 160,557 bytes
  • Duration: 20,000 ms
  • Conversion Time: 325 ms
  • WASM FFmpeg PCM-to-MP3 Transcoder: 1.2 MB
  • (our fork)

Latency Results Summary:

Service Format Cold Start (ms) Warm Start (ms) Unstable Network (Fast 4G, ms) Size (bytes)
HuggingFace Whisper WAV 23,777 12,616 40,304 1,920,126
HuggingFace Whisper MP3 24,546 13,191 15,214 160,557
Gemini Flash WAV N/A 3,264 14,290 1,920,126
Gemini Flash MP3 N/A 3,590 4,183 160,557

Key Observations

  1. MP3 Efficiency: MP3 format significantly reduces file size (over 10x smaller than WAV), making it far more efficient for transmission in unstable network environments.
  2. Local Environment Limitations: In a controlled local environment (frontend and backend running on the same machine), the MP3 size advantage did not show significant latency improvements.
  3. Network Impact: Under network-constrained environments (e.g., Fast 4G simulation), MP3 outperformed WAV significantly, reducing transmission latency to 4,183 ms compared to WAV's 14,290 ms.
  4. WASM FFmpeg Transcoder: The lightweight (1.2 MB) WASM FFmpeg PCM-to-MP3 transcoder provides efficient in-browser encoding with minimal overhead.
  5. Service Performance: Gemini Flash consistently outperformed Whisper in both WAV and MP3 scenarios.
  6. Acceleration Gains: Gemini Flash demonstrated a 6-7x improvement in processing speed compared to HuggingFace Whisper.

Key Improvement Areas

1. Adopt MP3 over WAV

  • MP3 minimizes transmission latency in network-constrained environments.
  • In-browser MP3 transcoding using WASM FFmpeg (1.2 MB) is lightweight and efficient.

2. Switch to a Faster Transcription Service

  • HuggingFace Whisper suffers from high latency in cold and warm start scenarios.
  • Gemini Flash offers significantly better latency performance and is better suited for real-time or near-real-time applications.

Conclusion

Achieving a round-trip latency of under 5 seconds is attainable through improvements in audio format and transcription service. These targeted changes will significantly enhance the user experience, particularly in network-limited educational environments.

@JacobLinCool JacobLinCool self-assigned this Dec 23, 2024
@Copilot Copilot bot review requested due to automatic review settings December 23, 2024 18:59

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 9 changed files in this pull request and generated no comments.

Files not reviewed (3)
  • package.json: Language not supported
  • src/lib/components/session/ParticipantView.svelte: Language not supported
  • src/routes/test/stt/+page.svelte: Language not supported
Comments suppressed due to low confidence (1)

src/lib/stt/gemini.ts:24

  • [nitpick] The error message 'Failed to transcribe audio' could be more descriptive. Consider including more details to aid in debugging.
throw new Error('Failed to transcribe audio');
@JacobLinCool JacobLinCool merged commit 69b2d77 into main Dec 23, 2024
4 checks passed
@JacobLinCool JacobLinCool deleted the asr-enhancement branch December 23, 2024 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Audio Transcription Latency
1 participant