Skip to content

Latest commit

 

History

History
473 lines (378 loc) · 19.5 KB

Machine_learning.md

File metadata and controls

473 lines (378 loc) · 19.5 KB

Machine Learning Cheatsheet

Welcome to the in-depth Machine Learning Cheatsheet! This resource is designed to provide both foundational knowledge and advanced insights into machine learning concepts, techniques, and best practices.

Table of Contents

  1. Foundations of Machine Learning
  2. Data Preprocessing
  3. Feature Engineering
  4. Machine Learning Algorithms
  5. Model Evaluation and Validation
  6. Hyperparameter Tuning
  7. Ensemble Methods
  8. Dimensionality Reduction
  9. Handling Imbalanced Data
  10. Interpretable Machine Learning
  11. Deployment and Production
  12. Advanced Topics
  13. Best Practices and Tips

Foundations of Machine Learning

Types of Machine Learning

  • Supervised Learning: Learning from labeled data
    • Classification: Predicting discrete classes
    • Regression: Predicting continuous values
  • Unsupervised Learning: Finding patterns in unlabeled data
    • Clustering: Grouping similar instances
    • Dimensionality Reduction: Reducing feature space
  • Reinforcement Learning: Learning through interaction with an environment
    • Model-based vs Model-free approaches
    • Policy Gradient methods vs Value-based methods

Key Concepts

  • Bias-Variance Tradeoff
    • High Bias: Underfitting, oversimplified model
    • High Variance: Overfitting, model too complex
    • Optimal balance: Low bias and low variance
  • Generalization
    • Training error vs Generalization error
    • Regularization techniques to improve generalization
  • Cross-Validation
    • k-fold cross-validation
    • Stratified k-fold for imbalanced datasets
    • Leave-one-out cross-validation for small datasets

Tip: Use the bias-variance tradeoff to guide your model selection and tuning. Start with a simple model and gradually increase complexity while monitoring both training and validation performance.

Data Preprocessing

Data Cleaning

  • Handling missing values
    • Deletion: Simple but can lead to data loss
    • Imputation: Mean, median, mode, or advanced methods (KNN, regression)
    • Using algorithms that handle missing values (e.g., XGBoost)
  • Outlier detection and treatment
    • Statistical methods: Z-score, IQR
    • Machine learning methods: Isolation Forests, Local Outlier Factor
    • Domain-specific rules
  • Handling duplicate data
    • Exact duplicates vs near-duplicates
    • Record linkage techniques for identifying similar entries

Data Transformation

  • Normalization (Min-Max Scaling)
    • Formula: (x - min(x)) / (max(x) - min(x))
    • Scales features to a fixed range, typically [0, 1]
  • Standardization (Z-score Scaling)
    • Formula: (x - mean(x)) / std(x)
    • Transforms data to have zero mean and unit variance
  • Log transformation
    • Useful for right-skewed distributions
    • Can help in making multiplicative relationships additive
  • Power transformation (Box-Cox)
    • Generalization of log transformation
    • Can handle both positive and negative skewness

Encoding Categorical Variables

  • One-Hot Encoding
    • Creates binary columns for each category
    • Can lead to high dimensionality for variables with many categories
  • Label Encoding
    • Assigns a unique integer to each category
    • Suitable for ordinal variables
  • Target Encoding
    • Replaces categories with the mean target value for that category
    • Can lead to overfitting if not done carefully

Tip: For high-cardinality categorical variables, consider using embedding techniques or dimensionality reduction methods before encoding.

Feature Engineering

Feature Creation

  • Domain-specific features
    • Leverage expert knowledge to create meaningful features
    • Example: In finance, creating technical indicators from price data
  • Interaction features
    • Capturing relationships between multiple features
    • Example: Multiplying 'height' and 'width' to get 'area'
  • Polynomial features
    • Creating higher-order terms of existing features
    • Useful for capturing non-linear relationships

Feature Selection

  • Filter methods
    • Correlation-based: Pearson correlation, mutual information
    • Statistical tests: Chi-squared test, ANOVA
    • Variance threshold: Removing low-variance features
  • Wrapper methods
    • Recursive Feature Elimination (RFE)
    • Forward/Backward feature selection
  • Embedded methods
    • Lasso regularization for linear models
    • Feature importance in tree-based models

Automated Feature Engineering

  • Featuretools for automated feature engineering
    • Deep feature synthesis
    • Stacking and clustering of features
  • AutoML platforms
    • H2O.ai AutoML
    • TPOT (Tree-based Pipeline Optimization Tool)

Tip: Use automated feature engineering as a starting point, but always validate and interpret the generated features. Combine automated methods with domain knowledge for best results.

Machine Learning Algorithms

Linear Models

  • Linear Regression
    • Assumptions: Linearity, Independence, Homoscedasticity, Normality (LIHN)
    • Regularization: Ridge (L2), Lasso (L1), Elastic Net (L1 + L2)
  • Logistic Regression
    • Binary and multinomial classification
    • Interpretation: Log odds and odds ratios

Tree-based Models

  • Decision Trees
    • Splitting criteria: Gini impurity, Information gain
    • Pruning techniques to prevent overfitting
  • Random Forests
    • Bagging + Random feature subset at each split
    • Out-of-bag (OOB) error estimation
  • Gradient Boosting Machines
    • XGBoost, LightGBM, CatBoost
    • Key parameters: learning rate, number of estimators, tree depth

Support Vector Machines (SVM)

  • Linear SVM vs Kernel SVM
  • Kernel tricks: RBF, Polynomial, Sigmoid
  • Soft margin classification and C parameter

k-Nearest Neighbors (k-NN)

  • Choice of k and its impact
  • Distance metrics: Euclidean, Manhattan, Minkowski
  • Weighted k-NN for improved performance

Naive Bayes

  • Gaussian NB, Multinomial NB, Bernoulli NB
  • Assumption of feature independence
  • Laplace smoothing for zero-frequency problem

Tip: For large datasets, start with fast algorithms like Naive Bayes or tree-based models for quick baselines before moving to more complex models.

Model Evaluation and Validation

Metrics for Classification

  • Accuracy, Precision, Recall, F1-Score
    • When to use each metric
    • Macro vs Micro vs Weighted averaging for multi-class problems
  • ROC AUC and PR AUC
    • Interpretation and when to prefer PR AUC over ROC AUC
  • Cohen's Kappa: Accounting for chance agreement

Metrics for Regression

  • Mean Squared Error (MSE) and Root MSE (RMSE)
  • Mean Absolute Error (MAE)
  • R-squared and Adjusted R-squared
  • Mean Absolute Percentage Error (MAPE)

Cross-Validation Techniques

  • K-Fold Cross-Validation
    • Choosing the right k
    • Repeated k-fold for more robust estimates
  • Stratified K-Fold
    • Maintaining class distribution in each fold
  • Time Series Cross-Validation
    • Forward chaining
    • Sliding window approaches

Tip: Use nested cross-validation when you're doing both model selection and performance estimation to avoid overfitting to your validation set.

Hyperparameter Tuning

Grid Search

  • Exhaustive search over specified parameter values
  • Computationally expensive for large parameter spaces

Random Search

  • Random sampling from parameter distributions
  • Often more efficient than grid search for high-dimensional spaces

Bayesian Optimization

  • Sequential model-based optimization
  • Efficient for expensive-to-evaluate functions
  • Popular libraries: Hyperopt, Optuna

Advanced Techniques

  • Genetic Algorithms
    • Evolutionary approach to hyperparameter optimization
  • Particle Swarm Optimization
    • Inspired by social behavior of bird flocking

Tip: Start with a coarse random search to identify promising regions of the parameter space, then refine with a focused Bayesian optimization.

Ensemble Methods

Bagging (Bootstrap Aggregating)

  • Random Forests
    • Feature importance and proximity analysis
  • Bagging meta-estimator in scikit-learn

Boosting

  • AdaBoost
    • Adaptive boosting algorithm
  • Gradient Boosting
    • XGBoost: Regularized gradient boosting
      • Key parameters: max_depth, min_child_weight, subsample
    • LightGBM: Gradient boosting with GOSS and EFB
      • Leaf-wise growth vs level-wise growth
    • CatBoost: Handling categorical features effectively

Stacking

  • Creating a meta-learner
  • Tips for effective stacking:
    • Use diverse base models
    • Use out-of-fold predictions for training the meta-learner

Tip: In competitions, focus on creating diverse models for your ensemble. In production, consider the trade-off between performance gain and increased complexity/maintenance cost.

Dimensionality Reduction

Linear Methods

  • Principal Component Analysis (PCA)
    • Explained variance ratio for selecting number of components
    • Incremental PCA for large datasets
  • Linear Discriminant Analysis (LDA)
    • Supervised method that considers class labels

Non-linear Methods

  • t-SNE (t-Distributed Stochastic Neighbor Embedding)
    • Perplexity parameter and its impact
    • Limitations: Non-deterministic, computationally expensive
  • UMAP (Uniform Manifold Approximation and Projection)
    • Often preserves global structure better than t-SNE
    • Parameters: n_neighbors, min_dist

Tip: Use PCA as a first step to reduce dimensionality before applying t-SNE or UMAP, especially for very high-dimensional data.

Handling Imbalanced Data

Resampling Techniques

  • Random Oversampling and Undersampling
    • Pros and cons of each approach
  • SMOTE (Synthetic Minority Over-sampling Technique)
    • Creating synthetic examples in feature space
  • ADASYN (Adaptive Synthetic Sampling)
    • Focuses on difficult-to-learn examples

Algorithm-level Approaches

  • Class weighting
    • Adjusting sample weights inversely proportional to class frequencies
  • Focal Loss
    • Down-weights the loss assigned to well-classified examples
  • Anomaly detection algorithms
    • One-class SVM, Isolation Forest for extreme imbalance

Tip: Combine resampling with ensemble methods like Random Forests or Gradient Boosting for robust performance on imbalanced datasets.

Interpretable Machine Learning

Model-specific Interpretability

  • Feature importance in tree-based models
    • Gini importance vs permutation importance
  • Coefficients in linear models
    • Standardizing features for coefficient comparison

Model-agnostic Methods

  • SHAP (SHapley Additive exPlanations) values
    • Game theoretic approach to feature importance
    • TreeSHAP for efficient computation with tree-based models
  • LIME (Local Interpretable Model-agnostic Explanations)
    • Explaining individual predictions
    • Limitations and potential instability
  • Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) plots
    • Visualizing feature effects on predictions

Tip: Combine global interpretability methods (like SHAP) with local explanations (like LIME) for a comprehensive understanding of your model's behavior.

Deployment and Production

Model Serialization

  • Pickle for Python objects
    • Security concerns with unpickling
  • JobLib for efficient storage of large NumPy arrays
  • ONNX for interoperability between frameworks

API Development

  • Flask for Python-based APIs
    • RESTful API design principles
  • FastAPI for high-performance APIs
    • Automatic API documentation with Swagger UI

Monitoring and Maintenance

  • Logging predictions and model performance
    • Tools: MLflow, Weights & Biases
  • Handling concept drift
    • Statistical methods for drift detection
    • Adaptive learning techniques
  • Model retraining strategies
    • Periodic retraining vs trigger-based retraining

Containerization and Orchestration

  • Docker for containerizing ML applications
  • Kubernetes for orchestrating containerized applications
  • Kubeflow for end-to-end ML pipelines on Kubernetes

Tip: Implement a robust monitoring system that tracks not just model performance, but also data distribution shifts and system health metrics.

Advanced Topics

Automated Machine Learning (AutoML)

  • AutoML platforms: H2O.ai, Auto-Sklearn, TPOT
  • Neural Architecture Search (NAS) for deep learning

Meta-Learning

  • Learning to learn across tasks
  • Few-shot learning techniques

Causal Inference in Machine Learning

  • Potential outcomes framework
  • Causal forests and causal boosting

Online Learning and Incremental Learning

  • Algorithms that can learn from streaming data
  • Handling concept drift in online settings

Federated Learning

  • Collaborative learning while keeping data decentralized
  • Challenges: Communication efficiency, privacy preservation

Tip: Stay updated with these advanced topics, but always evaluate their practical applicability to your specific problems and constraints.

Best Practices and Tips

  1. Start with a clear problem definition and success metrics.

    • Engage stakeholders to understand the business impact of your model.
    • Define quantifiable metrics that align with business goals.
  2. Establish a robust cross-validation strategy early on.

    • Ensure your validation strategy reflects the real-world use case of your model.
    • For time series data, use time-based splits rather than random splits.
  3. Build a strong baseline model before moving to complex algorithms.

    • Simple models provide a benchmark and help in understanding the problem.
    • Often, a well-tuned simple model can outperform a poorly tuned complex model.
  4. Version control your data, code, and models.

    • Use tools like DVC (Data Version Control) alongside Git.
    • Document your experiments thoroughly, including failed attempts.
  5. Regularly communicate results and insights to stakeholders.

    • Use visualization tools to make your results accessible to non-technical stakeholders.
    • Be transparent about your model's limitations and uncertainties.
  6. Keep up with the latest research, but be critical of new methods.

    • Implement new techniques only if they provide tangible benefits over existing methods.
    • Reproduce key results from papers to truly understand new methods.
  7. Participate in machine learning competitions to sharpen your skills.

    • Platforms like Kaggle provide real-world datasets and challenging problems.
    • Learn from top performers' solutions and share your own insights.
  8. Collaborate and share knowledge with the ML community.

    • Contribute to open-source projects.
    • Write blog posts or give talks about your experiences and learnings.
  9. Always consider the ethical implications of your ML models.

    • Assess potential biases in your data and models.
    • Consider the broader societal impact of your ML applications.
  10. Continuously learn and adapt to new tools and techniques in the field.

    • Set aside time for learning and experimentation.
    • Build a diverse skill set that includes statistical knowledge, programming, and domain expertise.
  11. Optimize your workflow and automate repetitive tasks.

    • Create reusable code modules for common tasks.
    • Use MLOps tools to streamline your ML pipeline.
  12. Pay attention to data quality and provenance.

    • Implement data quality checks at various stages of your pipeline.
    • Maintain detailed metadata about your datasets and their sources.
  13. Design your models with interpretability in mind from the start.

    • Choose inherently interpretable models when possible.
    • Incorporate explanation methods into your model development process.
  14. Regularly reassess and update your models in production.

    • Implement A/B testing for model updates.
    • Monitor for concept drift and retrain models when necessary.
    • Set up automated alerts for significant performance degradation.
  15. Optimize for both model performance and computational efficiency.

    • Profile your code to identify bottlenecks.
    • Consider using approximate algorithms for large-scale problems.
    • Leverage distributed computing frameworks for big data processing.
  16. Invest time in feature engineering and selection.

    • Combine domain expertise with data-driven approaches.
    • Use feature importance techniques to focus on the most impactful features.
    • Regularly reassess feature relevance as new data becomes available.
  17. Implement robust error handling and logging.

    • Anticipate and handle edge cases in your data preprocessing and model inference.
    • Set up comprehensive logging to facilitate debugging and auditing.
    • Use exception handling to gracefully manage runtime errors.
  18. Prioritize reproducibility in your work.

    • Use fixed random seeds for reproducible results.
    • Document your entire experimental setup, including hardware specifications.
    • Consider using containerization to ensure consistent environments.
  19. Balance model complexity with interpretability and maintainability.

    • Consider the long-term costs of maintaining complex models.
    • Use model compression techniques if deploying in resource-constrained environments.
    • Prioritize interpretable models for high-stakes decisions.
  20. Stay aware of the limitations of your models and data.

    • Clearly communicate the assumptions and constraints of your models.
    • Be cautious about extrapolating beyond the range of your training data.
    • Regularly validate your model's performance on out-of-sample data.
  21. Foster a culture of continuous improvement and learning.

    • Encourage experimentation and learning from failures.
    • Set up regular knowledge-sharing sessions within your team.
    • Stay connected with the broader ML community through conferences and meetups.
  22. Consider the end-user experience when designing ML systems.

    • Design intuitive interfaces for interacting with your models.
    • Provide clear explanations of model outputs and confidence levels.
    • Gather and incorporate user feedback to improve your models and interfaces.
  23. Implement proper security measures for your ML pipeline.

    • Protect sensitive data used in training and inference.
    • Be aware of potential adversarial attacks and implement defenses.
    • Regularly audit your ML systems for security vulnerabilities.
  24. Develop a systematic approach to hyperparameter tuning.

    • Start with a broad search and gradually refine.
    • Use Bayesian optimization for efficient exploration of hyperparameter space.
    • Keep detailed records of hyperparameter experiments and their results.
  25. Embrace uncertainty in your predictions.

    • Provide confidence intervals or prediction intervals when possible.
    • Use techniques like Monte Carlo Dropout for uncertainty estimation in neural networks.
    • Communicate the reliability and limitations of your predictions to stakeholders.

Remember, mastering machine learning is a journey of continuous learning and experimentation. The field is rapidly evolving, and staying current with new techniques and best practices is crucial. However, always balance the adoption of new methods with a critical evaluation of their practical benefits for your specific problems.

As you apply these techniques and best practices, you'll develop a nuanced understanding of when and how to use different approaches. Trust your intuition, but always validate it with empirical evidence. Don't be afraid to challenge conventional wisdom or to propose novel solutions to problems.

Lastly, always keep in mind the ethical implications of your work in machine learning. As ML systems increasingly impact people's lives, it's our responsibility as practitioners to ensure that these systems are fair, transparent, and beneficial to society.

Happy learning and may your models be ever accurate and your insights profound!