Have any question? +91 92 4658 2537 info@algorithmica.co.in
Edua

Data Science

Next Batch Starts on 04th Oct 2018

The course aims at developing both math and programming skills required for a data scientist. It allows us to get insight into data analysis problems that arise in business verticals and solving those problems using statistical and machine learning approaches. The course also focus upon the understanding fundamental math underlying those models. This course is more of practical research oreinted course than developer oriented. It focuses on 6 most common data analysis problems that arise in most business verticals: Classification, Regression, Recommender Systems, Clustering, Association Analysis and Outlier Detection.

Category: Data Analytics

  • Details
  • Objectives
  • Target Audience
  • Syllabus

Technical Details

Duration 120 Hours
Prerequisites Must have working knowledge of any object oriented programming language
Class Room Course Available
Video Course Unvailable

Upon successful completion of Data Science/Analytics course, participants will be able to:

  • Understand and Apply how statistical data analysis techniques are utilized in business decision making
  • Understand and Apply machine learning techniques in business data analysis
  • Solve the data analysis use case from its inception to deployment on their own
  • Apply algorithms to build machine intelligence
  • People under any of these following categories

  • Developers at all levels
  • BI professionals
  • DataWarehousing Professionals
  • Team Leads
  • Analytics Managers & Business Managers
  • 1. Introduction to Data Science/Analytics
  • Why does companies care about Data Scientist/Analyst?
  • Common myths & confusions: Data Analyst, Business Analyst, BI, DataMiner, ML Engineer, DataScientist etc.,
  • What is DataScience? Why DataScience?
  • Data driven product engineering
  • Skill-set of Data Scientist and How to become a Data Scientist?
  • Who is hiring? Career Opportunities
  • 3. Technology overview for Data Science/Analytics
  • Detailed Lifecycle to solve Datascience problems
  • Technologies for Data Science/Analytics
  • Languages: R/python/julia/scala
  • Frameworks & packages for structured data
  • Frameworks & packages for structured bigdata
  • Frameworks & packages for unstructured (big)data
  • Datasets for doing data science/analytics
  • 5. Applied Linear Algebra for data scientist
  • Applied perspective of Linear Algebra
  • Vector Algebra
  • ideas that map to vectors
  • understanding vector operations
  • understanding lienar independance
  • Matrix Algebra
  • ideas that map to matrices
  • fundamental ideas in matrix algebra:
  • matrix operations
  • determinant, eigen-values and eigen-vectors, inverse, rank
  • positive definite & semi-definiteness basis, orthogonal and ortho-normal basis
  • understanding factorization
  • SVD factorization
  • (Optional)LU factorization
  • (Optional)QR factorization
  • 7. Applied Probability for data scientist
  • Applied perspective of Probability theory
  • Basic Probability, Conditional Probability
  • Bayes Rule/Reasoning, MAP vs MLE Reasoning
  • Mapping Random process to Random variable
  • Properties of Random variables, expectation, variance, entropy and cross-entropy, covariance and correlation
  • Understanding standard random processes
  • Probability Distributions: Normal, Gamma, Poisson , Dirichlet, Bernoulli, Binomial, Powerlaw, Log normal, Multinomial
  • Parameter Estimation in Distributions: MAP and MLE approaches
  • 9. Applied Optimization theory for data scientist
  • Applied perspective of optimization theory
  • Non-ML vs ML optimization problems
  • Modelling ML problems with optimization requirements
  • Solving unconstrained optimization problems
  • Solving optimization problems with linear constraints
  • Gradient descent variations
  • Batch vs stochastic gradients
  • 11. Introduction to Machine Learning
  • Pattern discovery: Manual vs Automated
  • Supervised ML
  • Unsupervised ML
  • Reinforcement ML
  • Overfitted vs underfitted models
  • Techniques to mitigate overfitting
  • 13. Exploration of structured data(EDA)
  • pandas support for data sources & formats
  • working with csv & tsv files
  • working with RDBMS data
  • working json files
  • pandas dataframes
  • creating dataframes from differnt sources & formats
  • query with dataframe API
  • visual EDA
  • univariate plots: barchart, histogram, boxplot, densitycurve
  • multi-variate plots: facetgrids, factorplots, scatterplots
  • t-SNE plots
  • 15. Feature Engineering
  • Feature filtering techniques
  • Variance based filtering
  • Correlation based filtering
  • Feature Creation
  • Techniques to create new features
  • Feature Selection
  • Statistical feature selection
  • Model based featured selection
  • Feature Extraction & Transformation
  • PCA
  • 17. Usecase driven approach to solve Classification problem
  • Applied perspective of classification problem
  • Machine learning approaches to solve
  • classification problem
  • Tree learning approaches
  • Algorithms: CART, C4.5, C5.0
  • Overfitting control techniques: Prepruning, Cost complexity pruning, Pessimistic pruning
  • Probabilistic learning approaches
  • Algorithms: NaiveBayes
  • Overfitting control techniques
  • Objective based learning approaches
  • Algorithms: SVM,Logistic Regression, Neural Network
  • Overfitting control techniques: Lasso, Ridge & Elastic net penalties
  • Instance based learning: KNN
  • Ensemble based learning
  • Algorithms: Voting, Stacking, RandomForest, Adaboost, Gradient boosting, Extreme gradient boosting(XGB)
  • Overfitting control techniques
  • Evaluation Metrics for Classification Algorithms
  • Confustion matrix
  • Accuracy, Error Rate
  • Precision, Recall and F-Score
  • ROC curve, AUC
  • 19. Usecase driven approach to Recommendation problem
  • Applied perspective of recommendation problem
  • Top-N Recommender
  • Rating Prediction
  • Machine learning approaches to solve recommendation problem
  • Content based learning approaches
  • Collabative filtering(KNN based) approaches
  • Algorithms: UBCF, IBCF
  • Overfitting control techniques
  • Latent factor learning approaches
  • Algorithms: Funk algorithm
  • Overfitting control techniques: Lasso, Ridge & Elastic net penalties
  • Hybrid learning approches
  • Evaluation Metrics for Recommendation Algorithms
  • Top-N Recommnder: Accuracy, Error Rate
  • Rating Prediction: RMSE
  • 21. Usecase driven approach to Clustering problem
  • Applied perspective of clustering problem
  • Machine learning approaches to solve clustering problem
  • Iterative algorithms: K-means, K-medoids
  • Hierarchical algorithms: Wards algorithm
  • Desnity based algorithms: DB-SCAN
  • BIRCH algorithm
  • Evaluation Metrics for Clustering Algorithms
  • GT based metrics: Adj RandIndex, MutualInfo
  • No-GT based metrics: Silhouette
  • Coefficient, Calinski-Harabaz Index
  • 23. Usecase driven approach to Dimensionality Reduction problem
  • Applied perspective of feature reduction problem
  • Machine learning approaches to reduce dimensionality
  • Variance based approach
  • Algorithms: Linear PCA, Non-linear PCA
  • Evaluation: Captured Variance
  • Neighborhood based approach
  • Algorithms: t-SNE
  • Evaluation: KL-divergence
  • 2. Data Analysis Problems/Usecases across business verticals
  • Predictive Analytics Problems: Classification, Regression & Recommenders
  • Descriptive Analytics Problems: Frequent Pattern Mining, Clustering, Outlier Detection & Dimensionality Reduction
  • Types of Data: Structured, Time-Series, Text, Image, Voice and Video data
  • Business Verticals: Retail, Banking, Financial, Social, Web, Medical, Scientific, Logistics, Real Estate, etc.,
  • 4. Mastering Python Language
  • Why python for datascience?
  • Distributions for python
  • IDE for Python
  • Datastructure support in python: list, set, dictionary, tuple, dataframe, array & collections package
  • 3 paradigms in python: imperative, object-oriented, functional styles
  • Functional programming in python
  • Package/module creation in python
  • Debugging python programs
  • 6. Applied Statistics for data scientist
  • Applied perspective of statistics
  • Descriptive stats for single variable
  • mean, median, mode, quantiles, percentiles, standard deviation, variance, MAD, IQR
  • Descriptive stats for two variables, covariance, correlation
  • Hypothesis Testing
  • Statistical Tests
  • chi-square test
  • t-test
  • ANOVA
  • Inferential Statistics
  • 8. Applied Calculus for data scientist
  • Applied perspective of calculus
  • Rate of change
  • Concept of limit
  • Concept of derivative
  • Partial derivatives & gradient
  • Significance of gradient
  • Concept of integration
  • Applications of calculus
  • 10. Applied Information theory for data scientist
  • Applied perspective of information theory
  • Entropy, conditional entropy
  • Gini-index, conditional gini-index
  • Mututal Information
  • 12. Detailed lifecycle for solving datascience problems
  • Inception stage for business requirements
  • What & how to map requirements as datascience problems?
  • EDA
  • Data preprocessing phase
  • Feature engineering phase
  • Model building & tuning phase
  • Model evaluation phase
  • Model selection phase
  • Model deployment phase
  • Setting up flask server, Building RESTful ML services, Deploying ML services to flask
  • Integration of ML services
  • ML pipelines
  • 14. Data Preprocessing
  • Type conversions
  • Data normalization
  • Handling skew data
  • Handling missing data
  • Level matching of categorical features
  • 16. Model Evaluation & Tuning
  • Model evaluation techinques
  • Resubstitution approach
  • Resampling approach
  • Repeated Holdout
  • k-fold cv
  • Repeated k-fold cv
  • Bootstrapping
  • Parameter tuning
  • GridSearch
  • RandomSearch
  • BayesianSearch
  • 18. Usecase driven approach to solve Regression problem
  • Applied perspective of regression problem
  • Machine learning approaches to solve regression problem
  • Tree learning approaches
  • Algorithms: CART
  • Overfitting control techniques: Prepruning, Cost complexity pruning, Pessimistic pruning
  • Objective based learning approaches
  • Algorithms: SVM, Linear Regression, Neural Network
  • Overfitting control techniques: Lasso, Ridge & Elastic net penalties
  • Instance based learning: KNN
  • Ensemble based learning
  • Algorithms: Voting, Stacking, RandomForest, Gradient boosting, Extreme gradient boosting(XGB)
  • Overfitting control techniques
  • Evaluation Metrics for Regression Algorithms
  • RMSE(Root Mean Squared Error)
  • MAD(Mean Absolute Deviation)
  • R^2
  • 20. Usecase driven approach to Frequent Pattern Mining & Association Analysis
  • Applied perspective of FP-Mining & AssociationAnalysis
  • Machine learning algorithms to solve FP-Minign & AssociationAnalysis
  • Apriori, Eclat, FP-Growth
  • Evaluation Metrics for Frequent Pattern Mining
  • Support, Confidence & Lift
  • 22. Usecase driven approach to Outlier-Detection problem
  • Applied perspective of outlier-detection problem
  • Machine learning approaches to detect outliers
  • Robust PCA
  • Isolation Forest
  • Local Outlier Factor(LOF)
  • Evaluation Metrics for Clustering Algorithms
  • ROC curve & AUC
  • 24. Project(2 day Hackathon)
  • Hackathon(Day 1)
  • Hackathon(Day 2)