DeepInfant

DeepInfant® is a Neural network system designed to predict whether and why your baby is crying.

DeepInfant uses artificial intelligence and machine learning algorithms to determine which acoustic features are associated with each of a baby's needs.

For example, babies who are in pain demonstrate cries with high energy, while a fussy cry may have more periods of silence. We are preparing our findings for academic review and publication within a single well-trained model based on academic datasets.

Abstract

DeepInfant is an advanced neural network system specifically engineered for infant cry classification and analysis. The system employs deep learning techniques to help caregivers accurately interpret their baby's needs in real-time. The latest version, DeepInfant V2, achieves 89% accuracy through significant architectural improvements including:

Enhanced feature extraction using mel-spectrograms and STFT
Hybrid CNN-LSTM architecture for better temporal modeling
Expanded classification capabilities covering 5 distinct cry types
Optimized model efficiency with 3x faster inference time
Robust training on augmented datasets with over 10,000 labeled cry samples

Our research demonstrates that combining specialized audio processing with deep learning can provide reliable, real-time baby cry analysis through mobile applications, helping both parents and healthcare providers better respond to infant needs. Read our full research paper.

Summary

DeepInfant is a machine learning model that uses artificial intelligence to predict your baby's needs based on sound classification of cries.

Dataset

DeepInfant was initially developed as part of a final project in the Speech Technology Course at KTH (Royal Institute of Technology, Sweden). The latest version, DeepInfant V2, has been trained using a combination of private datasets along with published datasets to improve its generalization and accuracy. The training was further enhanced based on the V1 methodologies with additional refinements to optimize performance.

Data Collection & Labeling (from babycry)

Based on approaches from the babycry repository, data collection involved:

Collection of short audio segments (~2-7 seconds).
Labeling of audio clips using crowd-sourced and in-house volunteer annotations.
Tagging each audio segment with:
- Presence or absence of cry,
- Potential need (e.g., hunger, discomfort, pain),
- Age and gender.

Data Augmentation (from babycry)

To increase the robustness and variability of the training data, DeepInfant V2 leverages augmentation techniques:

Pitch Shifting: Modifying the pitch of the audio sample slightly to simulate different baby vocal ranges.
Time Stretching: Speeding up or slowing down the cry without affecting pitch to capture different crying tempos.
Adding Noise: Introducing background or white noise to simulate real-world environments (e.g., household hum, slight chatter).

These augmentations help the model generalize better and handle varying acoustic conditions.

iOS Models

This repo is published with pre-trained CoreML Models.

DeepInfant_VGGish
DeepInfant_AFP
DeepInfant_V2

Model	Feature Extraction	Training Data	Classification Window	Accuracy Improvement
DeepInfant_VGGish	Extra layers for classification	Limited dataset	Fixed 975ms	75%
DeepInfant_AFP	Optimized for speed	Limited dataset	Flexible	78%
DeepInfant_V2	Advanced feature extraction	Expanded dataset with new cry classes	Flexible & optimized	89%

DeepInfant V2: Expanded Methodology

By integrating lessons from the babycry repository, DeepInfant V2 expands upon the initial pipeline:

Data Preprocessing
- Trimming or zero-padding audio files to uniform lengths (2–7 seconds).
- Applying data augmentation (pitch shifting, time stretching, noise addition).
Feature Extraction
- Short-Time Fourier Transform (STFT)
- Mel-Spectrogram
- Log-Mel-Spectrogram normalization
Model Architecture
- Convolutional Neural Networks (CNN) for spectral feature extraction.
- Recurrent layers (LSTM) for temporal context modeling.
- Fully connected layers for final classification.
Evaluation
- Use of cross-validation and hold-out validation sets.
- Metrics: Accuracy, Precision, Recall, F1-score.

Below is an illustrative table adapted from the babycry approach:

Step	Description	Tools/Libraries
Data Labeling	Label audio segments as cry/non-cry and reason.	Crowdsource labeling, manual checks
Augmentation	Pitch shifting, time stretching, adding noise.	`librosa`, custom Python scripts
Feature Extraction	MFCC or Mel-spectrogram generation to represent audio features.	`librosa`, `scipy`, `numpy`
Classification	CNN, RNN, or hybrid architecture for final cry reason detection.	`pytorch`, `fastai`, `tensorflow`, `CoreML`
Evaluation	Evaluate model performance across multiple metrics and splits.	`scikit-learn` metrics, confusion matrix

Prediction Classes

Below, we list the classes that the model can predict.

Class	Description	Window Duration	Common Indicators
belly_pain	Indicates digestive discomfort or colic	0.975s	High-pitched, intense crying with legs drawing up
burping	Need to release trapped air	0.975s	Short, choppy cries with grunting sounds
cold_hot	Temperature-related discomfort	0.975s	Whiny, continuous cry with temperature changes
discomfort	General unease (diaper, position)	0.975s	Intermittent crying that changes with position
hungry	Need for feeding	0.975s	Rhythmic, repetitive cry pattern with rooting
lonely	Seeking attention or comfort	0.975s	Low-intensity cry that stops with holding
scared	Response to sudden changes/fear	0.975s	Sharp, sudden crying with startle response
tired	Need for sleep	0.975s	Grumbling cry with eye rubbing
unknown	Unclassified crying pattern	0.975s	Variable patterns not matching other categories

Training Algorithm

DeepInfant V2 is trained using a deep neural network with multiple stages of processing:

$$\begin{aligned} & \text{Input: Audio waveform } x(t) \text{ sampled at } 16,000 \text{ Hz} \\\ & \text{1. Apply Short-Time Fourier Transform (STFT): } X(f, t) = \sum_{n=0}^{N-1} x(n) e^{-j2\pi fn/N} \\\ & \text{2. Convert to Mel Spectrogram: } S = \text{MelFilterBank}(X) \\\ & \text{3. Normalize and apply log transformation: } S' = \log(S + \epsilon) \\\ & \text{4. Pass through convolutional layers with batch normalization} \\\ & \text{5. Extract high-level features using a recurrent neural network (LSTM)} \\\ & \text{6. Fully connected layers for classification with Softmax activation} \\\ & \text{Output: Probability distribution over cry classes}. \end{aligned}$$

Building a model

The audio clips have a sample rate of 16000 Hz and a duration of about ~7 secs. This means there are about 16000*7 numbers per second representing the audio data. We take a fast fourier transform (FFT) of a 2048 sample window, slide it by 512 samples, and repeat the process on the 7-sec clip. The resulting representation can be shown as a 2D image and is called a Short-Time Fourier Transform (STFT). Since humans perceive sound on a logarithmic scale, we'll convert the STFT to the mel scale. The librosa library lets us load an audio file and convert it to a melspectrogram:

import librosa
import librosa.display

fname = 'test-1-audio001.wav'
samples, sample_rate = librosa.load(fname)
fig = plt.figure(figsize=[4,4])
ax = fig.add_subplot(111)
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)
ax.set_frame_on(False)
S = librosa.feature.melspectrogram(y=samples, sr=sample_rate)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))

The melspectrogram of a baby crying looks like the image below:

Augmentations in Practice

From the babycry approach, we often apply augmentations before or after extracting the Mel-Spectrogram. For instance, pitch shifting is done on the raw waveform, whereas time stretching can be applied at the waveform level or sometimes in the frequency domain.

In order to build the spectrograms of the audio samples needed for training the model, we'll be using the fantastic audio loader module for fastai v1 built by Jason Hartquist.

n_fft = 2048 # output of fft will have shape [1024 x n_frames]
n_hop = 512  # 50% overlap between frames
n_mels = 128 # compress 2048 dimensions to 128 via mel frequency scale
sample_rate = 16000
tfms = get_frequency_batch_transforms(n_fft=n_fft, n_hop=n_hop,
              n_mels=n_mels, sample_rate=sample_rate)
batch_size = 64
data = (AudioItemList.from_folder(CRYING_PATH)
     .split_by_folder()
     .label_from_folder()
     .databunch(bs=batch_size, tfms=tfms, equal_lengths=False))
learn = create_cnn(data, models.resnet34, metrics=accuracy)
learn.lr_find(start_lr=0.001, end_lr=1)
learn.recorder.plot()

Fastai's cyclical learning rate finder runs the model against a small batch of training samples to find a good learning rate.

As the learning rate increases to 10e-2, you can see the model loss decrease. However, for higher learning rates, the loss begins to increase. Hence we pick 10e-2 as the learning rate for training the model.

After training the model over a few epochs, we see an accuracy of 95% over the validation set:

Predicting over realtime audio samples

Now that we have a really good model, in order to use it in a real application, we need to be able to run predictions over an audio stream in real time.

We use the pyaudio library to read audio samples from the device microphone and then convert the audio data into numpy arrays and feed it to the model.

while True:
    frames = [] # A python-list of chunks(numpy.ndarray)
    for _ in range(0, int(RATE / CHUNKSIZE * RECORD_SECONDS)):
        data = stream.read(CHUNKSIZE, exception_on_overflow=False)
        frames.append(np.fromstring(data, dtype=np.float32))
    npdata = np.hstack(frames)
    audio_clip = AudioClip.from_np(npdata, RATE)
    run_pred(audio_clip)

The above code reads a 7-sec audio clip from the microphone and loads that into memory. It converts it to a numpy array and runs the model on them to get a prediction. This simple piece of code is now ready to be deployed to a service or an embedded device and be used in real applications!

Convention

The audio files should contain baby cry samples, with the corresponding tagging information encoded in the filenames. The samples were tagged by the contributors themselves. So here's how to parse the filenames.

iOS Filename Structure

0D1AD73E-4C5E-45F3-85C4-9A3CB71E8856-1430742197-1.0-m-04-hu.caf

App instance UUID (36 chars): 0D1AD73E-4C5E-45F3-85C4-9A3CB71E8856
Unix epoch timestamp: 1430742197 (Mon, 04 May 2015 12:23:17 GMT)
App version: 1.0
Gender: m (male)
Age: 04 (0-4 weeks)
Reason: hu (hunger)

Android Filename Structure

0c8f14a9-6999-485b-97a2-913c1cbf099c-1431028888092-1.7-m-26-sc.3gp

App instance UUID (36 chars): 0c8f14a9-6999-485b-97a2-913c1cbf099c
Unix epoch timestamp: 1431028888092 (milliseconds)
App version: 1.7
Gender: m (male)
Age: 26 (2 to 6 months old)
Reason: sc (discomfort)

iOS and iPadOS App

This repo contains an example of using DeepInfant_VGGish, DeepInfant_AFP, or DeepInfant_V2 models to build an iOS app that analyzes a baby's cry sound and pushes prediction results with a tip on how to deal with each predicted result.

Citation

If you use DeepInfant in your research, please cite the following paper:

@article{DeepInfant,
    title={DeepInfant: A Deep Learning Model for Infant Cry Classification and Analysis},
    author={Skytells AI Research},
    year={2025}
}

Contact

For further assistance or inquiries, please reach out to the Skytells AI Research team at research@skytells.ai or visit our website at https://my.skytells.io.

References

[1] aqibsaeed/Urban-Sound-Classification, https://github.com/aqibsaeed/Urban-Sound-Classification, 2016
[2] Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang, & Mark D. Plumbley (2017) Attention and Localization based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging. arXiv:1703.06052v1
[3] Sharath Adavanne, Pasi Pertilä & Tuomas Virtanen (2017) Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network. arXiv:1706.02291v1

License

DeepInfant is licensed under the Apache License 2.0.

Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

Copyright 2025 Skytells AI Research

DeepInfant V2 is built on proprietary datasets along with publicly available datasets, ensuring robustness and real-world effectiveness in baby cry analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
Assets		Assets
Data		Data
Models		Models
iOS_iPadOS		iOS_iPadOS
.gitattributes		.gitattributes
DeepInfant.pdf		DeepInfant.pdf
LICENSE		LICENSE
README.md		README.md
TRAINING.md		TRAINING.md
deepinfant_training.ipynb		deepinfant_training.ipynb
predict.py		predict.py
prepare_dataset.py		prepare_dataset.py
requirements.txt		requirements.txt
train.py		train.py
train_huggingface.py		train_huggingface.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepInfant

Abstract

Summary

Dataset

Data Collection & Labeling (from babycry)

Data Augmentation (from babycry)

iOS Models

DeepInfant V2: Expanded Methodology

Prediction Classes

Training Algorithm

Building a model

Augmentations in Practice

Predicting over realtime audio samples

Convention

iOS Filename Structure

Android Filename Structure

Tags

Gender

Age

Reason

iOS and iPadOS App

Citation

Contact

References

License

About

Releases

Sponsor this project

Packages

Languages

License

skytells-research/DeepInfant

Folders and files

Latest commit

History

Repository files navigation

DeepInfant

Abstract

Summary

Dataset

Data Collection & Labeling (from babycry)

Data Augmentation (from babycry)

iOS Models

DeepInfant V2: Expanded Methodology

Prediction Classes

Training Algorithm

Building a model

Augmentations in Practice

Predicting over realtime audio samples

Convention

iOS Filename Structure

Android Filename Structure

Tags

Gender

Age

Reason

iOS and iPadOS App

Citation

Contact

References

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages