-
Notifications
You must be signed in to change notification settings - Fork 120
Machine Learning using PySpark
-
Fundamentals of Spark & Machine Learning Understanding Big Data, Hadoop & Spark. Core data structures like RDD, DataFrames. Distributed tables for machine learning like Vectors & Matrice, Black Box Introduction of Machine Learning. Understanding Machine Learning Pipeline & different stages. Data Ingestion, Streaming, Wrangling, Visualization, Preprocessing, Training Models, Validation & Deployment
-
Data Wrangling & Visualization Using DataFrames to understand,clean & getting summary of data. Data merging, duplicate deletion, statistical data analysis. Making use of matplotlib visualization of information.
-
Data Pre-processing Numerical Data Scaling & Normalization. Dealing with categorical data. Handling Images. OneHotEncoder, VectorEssembler. Everything, that is required to get data ready for machine learning. Dealing with Text - TFIDF, CountVectorizer, HashingVectorizer
-
Feature Selection & Extraction Spark deals with large datasets, Selecting important feature columns. VectorSlicer, RFormula, Correlation, ChiSqSelector, PCA, SVD,
-
Linear Models for Classification & Regression Understanding linear models like linear regression, logistic regression, Regularized regression. Intuition
about how distributed learning works. Problem solving using these -
Spark Pipeline, GridSearch, Model Validation & Persistance Connecting transformers with estimators in pipeline. Hyper-parameter tuning using GridSearch, Persisting models. CrossValidation for finding the best model.
-
Naive Bayes, Trees & Ensemble Methods Fundamentals of Naive Bayes, Decision Tree. Understanding Ensemble Learning methods like RandomForest, GBT. Understanding distributed implmentation of these algorithms. Problem solving using these
-
Clustering
Unsupervised Learning, Clustering, Bisecting KMeans, Gaussian Mixture Models, LDA. Customer segmentation using clustering methods -
Recommendation Engine Content Based Recommendation, Collaborative Filtering, Cold start Problem, Distance Vectors for product similarity, ALS Model.
-
Deep Learning in Spark Understanding Perceptron. Understanding deep neural network. Introduction to tensorflow. Deep Learning Pipeline on Spark. TensorFlow on Spark.