Skip to content

Machine Learning using PySpark

Awantik Das edited this page Oct 8, 2018 · 3 revisions
  1. Fundamentals of Spark & Machine Learning Understanding Big Data, Hadoop & Spark. Core data structures like RDD, DataFrames. Distributed tables for machine learning like Vectors & Matrice, Black Box Introduction of Machine Learning. Understanding Machine Learning Pipeline & different stages. Data Ingestion, Streaming, Wrangling, Visualization, Preprocessing, Training Models, Validation & Deployment

  2. Data Wrangling & Visualization Using DataFrames to understand,clean & getting summary of data. Data merging, duplicate deletion, statistical data analysis. Making use of matplotlib visualization of information.

  3. Data Pre-processing Numerical Data Scaling & Normalization. Dealing with categorical data. Handling Images. OneHotEncoder, VectorEssembler. Everything, that is required to get data ready for machine learning. Dealing with Text - TFIDF, CountVectorizer, HashingVectorizer

  4. Feature Selection & Extraction Spark deals with large datasets, Selecting important feature columns. VectorSlicer, RFormula, Correlation, ChiSqSelector, PCA, SVD,

  5. Linear Models for Classification & Regression Understanding linear models like linear regression, logistic regression, Regularized regression. Intuition
    about how distributed learning works. Problem solving using these

  6. Spark Pipeline, GridSearch, Model Validation & Persistance Connecting transformers with estimators in pipeline. Hyper-parameter tuning using GridSearch, Persisting models. CrossValidation for finding the best model.

  7. Naive Bayes, Trees & Ensemble Methods Fundamentals of Naive Bayes, Decision Tree. Understanding Ensemble Learning methods like RandomForest, GBT. Understanding distributed implmentation of these algorithms. Problem solving using these

  8. Clustering
    Unsupervised Learning, Clustering, Bisecting KMeans, Gaussian Mixture Models, LDA. Customer segmentation using clustering methods

  9. Recommendation Engine Content Based Recommendation, Collaborative Filtering, Cold start Problem, Distance Vectors for product similarity, ALS Model.

  10. Deep Learning in Spark Understanding Perceptron. Understanding deep neural network. Introduction to tensorflow. Deep Learning Pipeline on Spark. TensorFlow on Spark.

Clone this wiki locally