Data Science Take Home Assignments used in Datung.IO recruitment process.
Welcome to Datung.IO's take-home assignment repository, part of our data science recruitment process. The following assignment will let you extract, explore and analyze audio data from English speaking male and females, and build learning models aimed to predict a given person's gender using vocal features, such as mean frequency, spectral entropy or mode frequency.
The raw data consists of, as of 30th August 2018, 95,481 audio samples of male and female speakers speaking in short English sentences. The raw data is compressed using .tgz
files. Each .tgz
compressed file contains the following directory structure and files:
<file>/
etc/
GPL_license.txt
HDMan_log
HVite_log
Julius_log
PROMPTS
prompts-original
README
LICENSE
wav/
- 10 unique
.wav
audio files
- 10 unique
The total size of the raw dataset is approximately 12.5 GB once it has been uncompressed. The file format is .wav
with a sampling rate of 16kHz and a bit depth of 16-bit. The raw dataset can be found here.
We recommend considering the following for your data pre-processing:
- Automate the raw data download using web scraping techniques. This includes extraction of individual audio files.
- Pre-process data using audio signal processing packages such as WarbleR, TuneR, seewave for R, librosa, PyAudioAnalysis for Python, or similar packages for other programming languages
- Consider, in particular, the human vocal range, which typically resides within the range of 0Hz-280Hz
- To help you on your way to identify potentially interesting features, consider the following (non-exhaustive) list:
- Mean frequency (in kHz)
- Standard deviation of frequency
- Median frequency (in kHz)
- Mode frequency
- Peak frequency
- First quantile (in kHz)
- Third quantile (in kHz)
- Inter-quantile range (in kHz)
- Skewness
- Kurtosis
- Make sure to check out all of the files in the raw data, you might find valuable data in files beyond the audio ones
The following are reference points that should be taken into account in the submission. Please use them to guide the reasoning behind the feature extraction, exploration, analysis and model building, rather than answer them point blank.
- How did you go about extracting features from the raw data?
- Which features do you believe contain relevant information?
- How did you decide which features matter most?
- Do any features contain similar information content?
- Are there any insights about the features that you didn't expect? If so, what are they?
- Are there any other (potential) issues with the features you've chosen? If so, what are they?
- Which goodness of fit metrics have you chosen, and what do they tell you about the model(s) performance?
- Which model performs best?
- How would you decide between using a more sophisticated model versus a less complicated one?
- What kind of benefits do you think your model(s) could have as part of an enterprise application or service?
You have 7 days to complete the assignment from the time that you have received the email containing the link to this GitHub repository.
Note: Your submission will be judged with this timeframe in mind, and we do not expect the equivalent of a month's worth of work.
A written presentation, in HTML or PDF format, that clearly and succinctly walks us through your approach to extracting features, exploring them, uncovering any potential constraints or issues with the data in its provided form, your choice of predictive models and your analysis of the models' performance. Try to keep it concise. Please send your presentation (see Requirements below)to charan [at] datung.io, with it as an attachment, or provide us with a link to where you've uploaded your work.
Happy coding!
- Can I use <Insert SDK or Framework here> for the take home assignment?
Answer: Yes, you are free to make use of the tools that you are most comfortable working with. We work with a mix of frameworks, and try to use the one best fit for the task at hand.
- Where do I send my assignment upon completion?
Answer: You should have received an email with instructions about this take home assignment that led you to this repo. If you have been invited to the take home assignment, but haven't received an email, please email us at charan[at]datung.io.
- The raw data is too large to fit in memory, what do I do?
Answer: This is part of the challenge, and the dataset is by design larger than can fit in memory for a normal computer. You will have to come up with a solution that enables processing of the data in a batch-like, or streaming, fashion, to extract meaningful features.
- Where do I send my presentation of my results?
Answer: Please send it to charan [at] datung.io. In case you've uploaded your work somewhere else please provide a link that allows us to view it for evaluation.
- audio classification and feature extraction using librosa and pytorch: https://medium.com/@hasithsura/audio-classification-d37a82d6715
- audio analysis and feature extraction using librosa: https://athina-b.medium.com/audio-signal-feature-extraction-for-analysis-507861717dc1
- https://medium.com/@rijuldahiya/a-comprehensive-guide-to-audio-processing-with-librosa-in-python-a49276387a4b
- uses librosa and panda for audio preprocessing: https://www.youtube.com/watch?v=ZqpSb5p1xQo
- clone repository with
git clone https://github.com/08Aristodemus24/datung-machine-problem
- navigate to directory with
readme.md
andrequirements.txt
file - run command;
conda create -n <name of env e.g. datung-machine-problem> python=3.12.3
. Note that 3.12.3 must be the python version otherwise packages to be installed would not be compatible with a different python version - once environment is created activate it by running command
conda activate
- then run
conda activate datung-machine-problem
- check if pip is installed by running
conda list -e
and checking list - if it is there then move to step 8, if not then install
pip
by typingconda install pip
- if
pip
exists or install is done runpip install -r requirements.txt
in the directory you are currently in - when install is done assuming we are already in environment run
python download_data.py
to download.tar
files - after all downloads have finished and script is finished run next
python extract_data.py
to uncompress.tar
files and delete the unnecessary zip files after extraction. This will result in a data folder being created with all the subjects respective folders containing their audio recordings - run the
prepare_data
notebook as this will create the necessary files containing the features of the audio signals taken from the loaded raw audio signals - this will result in the creation of the
_EXTRACTED_DATA
folder in the data folder and in it will be two sub directories test and train containing the features for training and testing the model. Each sub directory contains the two categories of files for each subject: a feature file and a label files named<name of subject>_features.csv
and<name of subject>_labels.csv
. This we will concurrently load when training script is ran. - to train model run
python tuning_dl.py -m <name of model we want to use e.g. lstm list of models are listed in the dictionary object in the script> -c <the configuration we want our training to be e.g. deep (if we want to use a deep learning model and trad if we want a traditional ML model)> -lr <learning rate value e.g. 1e-3> --batch_size <batch size value e.g. 256> --mode <mode we want script to run on e.g. training> --hyper_param_list <a list of hyper parameters we use during training of our model separated by spaces with each value representing the hyper parameter we want to add and next to it its value separated by an underscore e.g. hertz_8000 window_time_0.25 hop_time_0.125 n_a_128 dense_drop_prob_0.2 rnn_drop_prob_0.2>
some preset commands we can already use:
python tuning_dl.py -m lstm -c deep -lr 1e-3 --batch_size 256 --mode training --hyper_param_list hertz_8000 window_time_0.25 hop_time_0.125 n_a_128 dense_drop_prob_0.2 rnn_drop_prob_0.2
to train a sequential LSTM neural network model\python tuning_dl.py -m softmax -c trad -lr 1e-3 --batch_size 256 --mode training --hyper_param_list hertz_8000 window_time_0.25 hop_time_0.125
to train a softmax regression model which will use the engineered features we have
-
you can run notebook visualization.ipynb to see performance metric values of the trained models
- testing/evaluation script to use saved model weights to see overall test performance of model
- there seems to be something wrong as the models seem to have loss go up during training and auc go down, which signifies the model may be underfitting, or not generalizing well on the training set. This may be because there aren't enough examples for the other class for the model to learn from and learns that the audio signals are mostly male voices, which we know in the dataset outweighs by large margin the female labeled audio recordings. Solution could be to gather more female recordings, and extract more features from it. Another less viable option is to undersample the male class so that it is equal to the amount of female audio signal inputs
- hyper parameter tuning to determine more viable hyper parameters for each model
- learn and try tensorflow decision forest models and see if if will be better than a typical softmax regression model
- learn more about audio signal processing as I still don't know how to better extract features from audio signals without me fully understanding concepts like mel spectrograms, spectral centroids, etc.
- solving why f1 score seems to bee a numpy array instead of a single value: https://stackoverflow.com/questions/68596302/f1-score-metric-per-class-in-tensorflow
: RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[128,128] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu 2025-03-12 16:17:33.380804: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[256,128] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu
this error may be due to the immense size of the input data which we know is (m, 2000, 1) and given we have 6815 subjects, which is incomparable to the previous project I did which only had 43 subjects at most, this preprocessing of the data for deep learning tasks, I might have to do with a better machine, or somehow interpolate the raw audio signals to a much lower frequency, which may unfortunately cause importatn features to be lost.