Infant Audio Classification: Cry, Scream, and Normal Utterances

Project Overview

This project aims to develop a robust ensemble-based audio classification system that distinguishes between infant cries, screams, and normal utterances. The approach integrates two powerful models, YAMNet and Wav2Vec2, to leverage both pre-trained knowledge and fine-tuned accuracy for this specific classification task.

The system processes raw audio input, extracts relevant features, and classifies the sounds into three categories:

Crying
Screaming
Normal Utterances

An ensemble technique is used to combine the predictions from YAMNet and Wav2Vec2, ensuring improved accuracy and robustness.

Features

Multi-dataset integration for diverse and balanced training.
Preprocessing techniques including noise reduction and pitch normalization.
Fine-tuned YAMNet and Wav2Vec2 models.
Ensemble learning to improve classification accuracy.
Google Colab integration for ease of execution.
Performance evaluation through accuracy, precision, recall, and confusion matrices.

Theory and Model Explanation

1. YAMNet

YAMNet is a deep neural network trained on AudioSet to recognize over 500 different sound classes. It uses Mel spectrograms as input features and applies a MobileNet architecture.

Modifications in this project:

Fine-tuned the model on our specific dataset.
Adjusted output layers to classify three specific categories.

2. Wav2Vec2

Wav2Vec2 is a self-supervised speech representation learning model by Facebook AI. Unlike YAMNet, it learns contextualized representations directly from raw audio signals, making it robust to noise and variations.

Modifications in this project:

Fine-tuned on labeled infant audio data.
Adjusted classification head for cry, scream, and normal utterance detection.

3. Ensemble Model

We implemented an ensemble approach by combining outputs from both models using:

Averaging Probabilities: Taking the mean of both models' predictions.
Majority Voting: Selecting the most frequent predicted class.

This helps in reducing errors and improving overall accuracy.

Dataset

The project utilizes multiple datasets:

Infant Cry Audio Corpus from KAGGLE
Human Screaming Detection Dataset from KAGGLE
Children speech Audioset 4

All datasets were preprocessed to ensure:

Consistent sample rates
Uniform bit-depth normalization
Proper segmentation and labeling

Setup Instructions

1. Google Colab Setup

This project is implemented in Google Colab for ease of execution.

Steps to Run on Colab:

Open Google Colab: Colab Link
Upload the dataset to Google Drive.

Mount Google Drive in Colab:

from google.colab import drive
drive.mount('/content/drive')

Clone the GitHub repository:

!git clone https://github.com/Arjun-08/Developing-an-Ensemble-Model-for-Detecting-Infant-Cries-Screams-and-Normal-Utterances.git
cd Developing-an-Ensemble-Model-for-Detecting-Infant-Cries-Screams-and-Normal-Utterances

Install required dependencies:
```
!pip install -r requirements.txt
```
Run the training and inference scripts (detailed below).

Training, Testing, and Validation

Dataset Split:

70% Training
15% Validation
15% Testing

Evaluation Metrics:

Accuracy
Precision, Recall, F1-score
Confusion Matrices
ROC Curves

Inference Instructions

Once the model is trained, you can perform inference on new audio files.

Run Inference on a Test Audio File:

from model import run_inference
prediction = run_inference('/content/drive/MyDrive/frontera/extracted_data/Screaming/---1_cCGK4M_out.wav')
print("Predicted Label:", prediction)

Example Prediction Output:

Predicted Label: [3] (crying)

Results

YAMNet Model Performance:

Epoch 1/10
accuracy: 0.6756 - loss: 0.9839 - val_accuracy: 0.7826 - val_loss: 0.9807
...
Epoch 10/10
accuracy: 0.8676 - loss: 0.4984 - val_accuracy: 0.7826 - val_loss: 0.7915

Wav2Vec2 Model Performance:

Epoch 1   Validation Loss: 0.815121
Epoch 2   Validation Loss: 0.833583
Epoch 3   Validation Loss: 0.837320

Ensemble Model Performance:

Train Loss: 0.6878751118977865
Test Accuracy: 0.7681
Test Precision: 0.5900
Test Recall: 0.7681
Test F1 Score: 0.6674

References

YAMNet Model
Wav2Vec2 Paper
AudioSet Dataset

Contact

For questions or collaborations, reach out via nvarjunmani07@gmail.com.

Acknowledgments

Special thanks to the Team FRONTERA HEALTH and dataset providers that made this research possible!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Infant Audio Classification: Cry, Scream, and Normal Utterances

Project Overview

Features

Theory and Model Explanation

1. YAMNet

2. Wav2Vec2

3. Ensemble Model

Dataset

Setup Instructions

1. Google Colab Setup

Training, Testing, and Validation

Dataset Split:

Evaluation Metrics:

Inference Instructions

Run Inference on a Test Audio File:

Example Prediction Output:

Results

YAMNet Model Performance:

Wav2Vec2 Model Performance:

Ensemble Model Performance:

References

Contact

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Infant Audio Classification: Cry, Scream, and Normal Utterances

Project Overview

Features

Theory and Model Explanation

1. YAMNet

2. Wav2Vec2

3. Ensemble Model

Dataset

Setup Instructions

1. Google Colab Setup

Training, Testing, and Validation

Dataset Split:

Evaluation Metrics:

Inference Instructions

Run Inference on a Test Audio File:

Example Prediction Output:

Results

YAMNet Model Performance:

Wav2Vec2 Model Performance:

Ensemble Model Performance:

References

Contact

Acknowledgments