A collection of datasets for Data Visualization, Data Analysis, and Machine Learning tasks
- Anomaly Detection
- Audio Processing
- Audio Classification
- Music Identification
- Speech Recognition
- Computer Vision
- Face Recognition
- Image Classification
- Image Clustering
- Image Generation
- Image Segmentation
- Object Detection
- Optical Character Recognition
- Graph Data
- Natural Language Processing
- Machine Translation
- Sentiment Analysis
- Text Classification
- Text Clustering
- Text Generation
- Text Summarization
- Tabular Data
- Time Series
Here are some popular resources to download a wide range of datasets for your projects:
Source | Description |
---|---|
Kaggle Datasets | A comprehensive collection of datasets across various domains, including machine learning, computer vision, NLP, and more. |
UCI Machine Learning Repository | A well-known collection of datasets for machine learning tasks. |
Google Dataset Search | A search engine that helps you find datasets stored across the web. |
AWS Public Datasets | Amazon's collection of public datasets, including data related to machine learning, genomics, and more. |
OpenML | An open platform for sharing datasets, machine learning algorithms, and experiments. |
Data.gov | A U.S. government site offering datasets for a wide range of public sectors, including health, agriculture, and energy. |
Zenodo | A general-purpose repository for research datasets, articles, and software, with great support for open data. |
Here are some essential tools and libraries for working with datasets:
Category | Library | Description |
---|---|---|
Data Manipulation & Analysis | Pandas | A powerful library for data manipulation and analysis, providing data structures like DataFrames. |
NumPy | A fundamental library for numerical computing in Python, supporting large, multi-dimensional arrays and matrices. | |
Dask | A parallel computing library that scales Pandas and NumPy workflows for larger-than-memory datasets. | |
Data Visualization | Matplotlib | A popular library for creating static, animated, and interactive visualizations in Python. |
Seaborn | Built on top of Matplotlib, it provides a high-level interface for drawing attractive statistical graphics. | |
Plotly | An interactive graphing library for Python, useful for creating web-based visualizations. | |
Bokeh | A visualization library for creating interactive plots and dashboards. | |
Machine Learning & Data Science | Scikit-learn | A simple and efficient library for machine learning in Python. |
TensorFlow | An open-source framework for building and training machine learning models. | |
PyTorch | A deep learning framework offering flexibility and speed. | |
XGBoost | A highly efficient gradient boosting library for regression, classification, and ranking tasks. | |
Data Preprocessing | OpenCV | A powerful library for computer vision tasks, including image processing and feature extraction. |
Librosa | A Python package for music and audio analysis. | |
nltk | A library for natural language processing with easy-to-use tools for text analysis. | |
Data Cleaning & Transformation | Cleanlab | A library for automatically detecting and correcting data errors in real-world datasets. |
Great Expectations | A framework for data testing, documentation, and profiling. | |
Pyjanitor | A Python library for cleaning and transforming datasets, providing a simple and efficient API. |
Any mistakes, suggestions, or contributions? Feel free to reach out to me at:
I look forward to connecting with you! 🏃♂️
This project is licensed under the MIT License.
The datasets in the ./data/ directory may be subject to their own licenses and usage restrictions, which are specified within each respective folder.