A machine learning project comparing the effectiveness of RandomForest, K-Nearest Neighbours (KNN), and Support Vector Machine (SVM) models for music genre classification. Through systematic feature selection and model optimisation, the KNN classifier achieved 91% accuracy on the GTZAN dataset.
- Initial model training with the full feature set
- Feature importance analysis using RandomForestClassifier
- Iterative feature selection based on importance rankings:
- Top 3 features: chroma_stft_mean, spec_bandwidth_mean, rolloff_mean
- Additional significant features: mfcc1_mean, mfcc2_mean, mfcc3_mean
- Model retraining with optimised feature subset
-
K-Nearest Neighbours (KNN)
- Best performing model: 91% accuracy
- Optimised parameters through RandomSearchCV
- Robust performance across genres
-
Random Forest
- Used for initial feature importance analysis
- Secondary classification model
-
Support Vector Machine (SVM)
- Comparative baseline model
- Performance evaluation with different kernels
- Computational efficiency considerations
- KNN achieved highest accuracy (91%) with optimised feature set
- Feature reduction from original set to top performers maintained accuracy
- Cross-validation scores demonstrate model stability
- Detailed confusion matrix highlighting per-genre performance
.
├── features_3_sec.csv # Feature set with 3-second windows
├── features_30_sec.csv # Feature set with 30-second windows
├── music_genre_classification.ipynb # Implementation and analysis
├── LICENCE
└── README.md
- Python 3.x
- scikit-learn
- pandas
- numpy
- seaborn (visualisation)
- matplotlib (visualisation)
- Clone the repository:
git clone https://github.com/lukasz-iskierka/ml-music-classification.git
- Install dependencies:
pip install scikit-learn pandas numpy seaborn matplotlib
- Run the Jupyter notebook for detailed analysis and results:
jupyter notebook music_genre_classification.ipynb
- Ensemble method exploration
- Additional feature engineering
- Model optimisation for specific genre pairs
- Performance optimisation for larger datasets
See LICENCE file for details.
For questions or suggestions, please open an issue in the GitHub repository.