Skip to content

mesut92/nemo_asr_websocket

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time Speech Recognition WebSocket with NeMo toolkit

This project is a real-time Automatic Speech Recognition (ASR) client using WebSockets to stream audio to a server for transcription. It captures audio from a microphone, sends it to a WebSocket server, and prints the received transcriptions.

Prerequisites

Before setting up the project, ensure you have the following installed:

  • Python 3.10
  • Conda (optional)
  • Docker image (optional, recommended for environment management)
  • NeMo (NVIDIA's conversational AI toolkit)

Preview

Watch the video

Installation

Step 1: Set Up a Conda Environment

To avoid dependency conflicts, create a new conda environment:

conda create --name rt_asr python=3.10 -y
conda activate rt_asr

Step 2: Install NeMo

NeMo is required for speech recognition processing. You can install it using the following command:

pip install nemo_toolkit[all]

If you encounter issues, follow the official NeMo installation guide: NVIDIA NeMo GitHub.

Run the following command to install NeMo:

pip install git+https://github.com/NVIDIA/NeMo.git

Step 3: Install Project Dependencies

  • PyAudio may require additional system dependencies; install them using:
    sudo apt-get install sox libsndfile1 ffmpeg portaudio19-dev # (For Linux)

After activating the environment, install the required Python packages:

pip install -r requirements.txt

Or you can use docker container for asr server.

Docker Setup

To build and run the Docker container for this project, follow these steps:

Build the Docker Image

docker build . -t ws_asr

Run the Docker Container

docker run --gpus all -p 8766:8766 --rm ws_asr

Usage

1. Start the WebSocket Server

Ensure your WebSocket ASR server is running at ws://localhost:8766 before starting the client.

python server.py

2. Run the Client

python client.py

3. Select the Audio Input Device

After running the client, it will list available audio input devices. Choose the appropriate device by entering its corresponding ID.

4. Start Speaking

Once the connection is established, the system will capture audio and send it to the WebSocket server for transcription in real-time.

Cache-Aware Streaming FastConformer

This project utilizes NeMo models trained for streaming applications, as described in the paper: Noroozi et al. "Stateful FastConformer with Cache-based Inference for Streaming Automatic Speech Recognition" (accepted to ICASSP 2024).

Model Features

  • Trained with limited left and right-side context to enable low-latency streaming transcription.
  • Implements caching to avoid recomputation of previous activations, reducing latency further.

Available Model Checkpoints

  1. stt_en_fastconformer_hybrid_large_streaming_80ms - 80ms lookahead / 160ms chunk size
  2. stt_en_fastconformer_hybrid_large_streaming_480ms - 480ms lookahead / 540ms chunk size
  3. stt_en_fastconformer_hybrid_large_streaming_1040ms - 1040ms lookahead / 1120ms chunk size
  4. stt_en_fastconformer_hybrid_large_streaming_multi - 0ms, 80ms, 480ms, 1040ms lookahead / 80ms, 160ms, 540ms, 1120ms chunk size

Model Inference Process

  • Audio is continuously recorded in chunks and fed into the ASR model.
  • Using pyaudio, an audio input stream passes the audio to a stream_callback function at set intervals.
  • The transcribe function processes the audio chunk and returns transcriptions in real-time.
  • Chunk size determines the duration of audio processed per step.
  • Lookahead size is calculated as chunk size - 80 ms (since FastConformer models have a fixed 80ms output timestep duration).

Configuration

  • WebSocket Server URL: You can change the WebSocket server address in server.py by modifying the WS_URL variable.
  • Audio Parameters: Adjust SAMPLE_RATE, chunk_size, and ENCODER_STEP_LENGTH in client.py to fine-tune the audio streaming behavior.

Troubleshooting

  • If NeMo installation fails, refer to the NeMo Installation Guide for specific dependencies and troubleshooting steps.
  • Ensure your WebSocket server is running and accessible at the correct URL.
  • If no audio input devices are found, check your microphone settings and ensure that pyaudio is correctly installed.

License

This project is open-source. Feel free to modify and improve it!


Notes

  • The WebSocket server handling ASR is expected to be running separately.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published