Have any question? +91 92 4658 2537 info@algorithmica.co.in
Edua

BigData Analytics with Spark

Previous Batch Started on 25th Aug 2018

Modern applications generate the BIG data that requires complex processing and storage demands to meet business needs. Distributed computing and storage technologies are mandatory to deal with such massive data. This course is targeted to understand BIGDATA and its storage and processing demands which includes query, machine learning & streaming jobs with bigdata.

Category: Data Analytics

  • Details
  • Objectives
  • Target Audience
  • Syllabus

Technical Details

Duration 30-35 Hours
Prerequisites Must have working knowledge of python, datascience course(ml algorithms)
Class Room Course Available
Video Course Unvailable

Upon successful completion of Big Data Analytics with Spark course, participants will be able to:

  • Understand the issues with BIGDATA processing and storage
  • Understand the need of distributed processing and storage
  • Use SparkSQL for quering and reporting
  • solve bigdata science problems using sparkML
  • solve streaming problems using sparkStreaming
  • People under any of these following categories

  • Developers at all levels
  • Team Leads & Technical Managers
  • 1. BIGDATA storage & processing
  • Origins of BIGDATA
  • BIGDATA storage
  • Issues with storing bigdata
  • Frameworks to solve BIGDATA storage:HDFS, S3, NoSQL stores
  • BIGDATA processing
  • Issues with processing bigdata
  • Frameworkds to solve BIGDATA processing: mapreduce, spark engine
  • Stream processing
  • Issues with stream processing
  • Frameworkds to solve stream processing: Strom, Spark streaming
  • Hadoop vs Spark
  • Hadoop/Spark Usecases: DataConsolidation, ETL, BIGDATA query and reporting, DataScience Platform, Streaming Platform
  • Vendor comparison (Cloudera, Hortonworks, MapR, DataBricks)
  • 3. Spark Job Execution
  • Spark clustering architecture
  • Jobs, Stages & Tasks
  • Job Tracking
  • Partitions & Shuffling
  • DataLocality
  • Caching support
  • 5. Detailed lifecycle for solving Bigdata science problems
  • Inception stage for business requirements
  • What & how to map requirements as datascience problems?
  • EDA
  • Data preprocessing phase
  • Feature engineering phase
  • Model building & tuning phase
  • Model evaluation phase
  • Model selection phase
  • Model deployment phase
  • Setting up flask server
  • Building RESTful ML services
  • Deploying ML services to flask
  • Integration of ML services
  • ML pipelines
  • 7. Feature Engineering
  • Feature filtering techniques
  • Variance based filtering
  • Correlation based filtering
  • Feature Creation
  • Techniques to create new features
  • Feature Selection
  • Statistical feature selection
  • Model based featured selection
  • Feature Extraction & Transformation
  • PCA
  • 9. BIGDATA(structured) predictive analysis with spark
  • Predictive Analytics Problems
  • Classification
  • Regression
  • Recommenders
  • Supervised ML support in spark
  • Usecase driven approach to solve Classification problem
  • Algorithms: DecisionTree, RandomForest, GBTree, Logistic Regression, NeuralNetwork, NaiveBayes
  • Metrics: Accuracy, AUC, F-Score
  • Usecase driven approach to solve Regression problem
  • Algorithms: LinearRegression, Generalized LinearRegression, DecisionTree, RandomForest, GBTree Regression Metrics: RMSE, R^2
  • Usecase driven approach to solve
  • Recommendation problem
  • Algorithms: Collaborative Filtering
  • Metrics
  • Top-N Recommnder: Accuracy, Error Rate
  • Rating Prediction: RMSE
  • 11. (optional)Stream processing with spark
    13. Project
    2. Spark Overview
  • Custom setup vs cloud(PaaS) service
  • Spark components & ecosystem
  • Core datastructures: RDD vs Dataframe
  • Spark API: RDD based API vs Dataframe based API
  • RDD fundamentals
  • Spark commands vs Spark application
  • 4. Exploration of structured BIGDATA with spark(EDA)
  • Spark support for BIGDATA storage sources & formats
  • BIGDATA storage frameworks: HDFS, S3, cassandra, Hbase
  • spark support to work on different storage sources
  • working with csv, tsv files
  • working json files
  • working with ORC & parquet format data
  • Spark dataframes
  • creating dataframes from differnt sources & formats
  • query with dataframe API and SQL
  • caching & reusing dataframes
  • Visual EDA
  • univariate plots: barchart, histogram, boxplot, densitycurve
  • multi-variate plots: facetgrids, factorplots, scatterplots
  • t-SNE plots
  • 6. Data Preprocessing
  • Type conversions
  • Data normalization
  • Handling skew data
  • Handling missing data
  • Level matching of categorical features
  • 8. Model Evaluation & Tuning
  • Model evaluation techinques
  • Resubstitution approach
  • Resampling approach
  • Repeated Holdout
  • k-fold cv
  • Repeated k-fold cv
  • Bootstrapping
  • Parameter tuning
  • GridSearch
  • RandomSearch
  • BayesianSearch
  • 10. BIGDATA(structured) descriptive analysis with spark
  • Descriptive Analytics Problems
  • Clustering
  • Dimensionality Reduction
  • FP-Mining & Assoication Analysis
  • Outlier Detection
  • Unsupervised ML support in spark
  • Usecase driven approach to solve Clustering problem
  • Algorithms: K-Means, LDA, GMM
  • Metrics
  • GT based metrics: Adj RandIndex, MutualInfo
  • No-GT based metrics: Silhouette Coefficient, Calinski-Harabaz Index
  • Usecase driven approach to solve
  • Dimensionlity Reduction problem
  • Algorithms: PCA
  • Metrics: captured variance
  • Usecase driven approach to solve FP-Mining & Association Analysis problem
  • Algorithms: FP-growth
  • Metrics: Support, Confidence & Lift
  • 12. (optional)Datacollection techniques & tools